Opened 4 months ago

Last modified 4 weeks ago

#28229 new defect

Possible race condition opening SOCKS Port in test_rebind.py

Reported by: teor Owned by: rl1987
Priority: High Milestone: Tor: 0.3.5.x-final
Component: Core Tor/Tor Version: Tor: 0.3.5.1-alpha
Severity: Normal Keywords: tor-ci, 035-can
Cc: Actual Points:
Parent ID: Points:
Reviewer: teor Sponsor:

Description

test_rebind.py failed in master because the SOCKSPort wasn't open when the python script tried to connect:
https://travis-ci.org/torproject/tor/builds/448008717

When we merge #27968, we'll get better debugging. But we should also print all the tor logs to the console, particularly when there is a failure.

And we should print the test_rebind.py logs with timestamps.

Child Tickets

Change History (21)

comment:1 Changed 4 months ago by rl1987

Owner: set to rl1987
Status: newaccepted

Will try to look into this in next few days.

comment:2 Changed 4 months ago by rl1987

Patch for diagnostics: https://github.com/torproject/tor/pull/473

This should help with digging up the root cause of the failure, if it happens again.

comment:3 Changed 4 months ago by teor

Status: acceptedneeds_review

We should review and merge this patch, then leave the ticket open for further analysis.

comment:4 Changed 3 months ago by dgoulet

Reviewer: catalyst

comment:5 in reply to:  2 Changed 3 months ago by catalyst

Status: needs_reviewmerge_ready

Replying to rl1987:

Patch for diagnostics: https://github.com/torproject/tor/pull/473

This should help with digging up the root cause of the failure, if it happens again.

Thanks for the patch! I left a minor comment on the pull request, which you may address if you like (timestamp redundancy).

comment:6 Changed 3 months ago by nickm

Keywords: 035-can added

comment:7 Changed 3 months ago by teor

Status: merge_readyneeds_revision

The logging needs millisecond resolution: second resolution is not enough to reassemble the script and tor logs in the correct order.

I left some detailed notes on the pull request.

comment:8 Changed 3 months ago by rl1987

Status: needs_revisionneeds_review

comment:9 Changed 3 months ago by catalyst

Reviewer: catalystteor

Reviewer to teor because they made the most recent round of review feedback. Please feel free to toss back to me if necessary.

comment:10 Changed 3 months ago by teor

Status: needs_reviewmerge_ready

This change seems fine to me.

comment:11 Changed 3 months ago by nickm

Status: merge_readyassigned

I've squashed and rebased the branch as bug28229_diag_035, and merged it to 0.3.5.

Since this is a diagnostic patch, I assume I should not actually close this issue. setting it to new again.

comment:12 Changed 3 months ago by nickm

Status: assignednew

comment:13 Changed 3 months ago by nickm

I think this build log might have an example of the failure: https://travis-ci.org/torproject/tor/jobs/462336253

comment:14 Changed 3 months ago by rl1987

Some notes about the above CI log:

  • test_rebind.py fails to connect to SOCKS listener after rebinding - it succeeds connecting the first time.
  • There's no log messages about closing the 127.0.0.1:13790 listener, or the new listener being replaced by new one. Weird.

comment:15 Changed 3 months ago by rl1987

One more pull request for diagnostics: https://github.com/torproject/tor/pull/560

  • This time, let's use getsockopt() to check if new listener is actually listening.
  • Let's log some stuff when socket rebinding is supposed to happen.
  • Let's log the proper error message when socket.connect_ex() fails on Python side.

comment:16 Changed 3 months ago by rl1987

Status: newneeds_review

comment:17 Changed 3 months ago by teor

Status: needs_reviewmerge_ready

Looks good to me, but it's a diagnostic, so we shouldn't close the ticket.

comment:18 Changed 2 months ago by nickm

Status: merge_readyassigned

Merged to master, not closed.

comment:19 Changed 2 months ago by nickm

Status: assignednew

comment:20 Changed 4 weeks ago by rl1987

https://travis-ci.org/rl1987/tor/jobs/481321719

Hmmm....

2019-01-18 11:58:35.059 Tor logged: "Jan 18 11:58:35.059 [info] set_max_file_descriptors: Raising max file descriptors from 10208 to 9223372036854775807.", waiting for "Opened Socks listener"
2019-01-18 11:58:35.059 Tor logged: "Jan 18 11:58:35.059 [info] set_max_file_descriptors: Dropped connection limit to 10208 based on OPEN_MAX (10240); Apparently, 9223372036854775807 was too high and rlimit lied to us.", waiting for "Opened Socks listener"
2019-01-18 11:58:35.059 Tor logged: "Jan 18 11:58:35.059 [warn] You specified a public address '0.0.0.0:58579' for SocksPort. Other people on the Internet might find your computer and use it as an open proxy. Please don't allow this unless you have a good reason.", waiting for "Opened Socks listener"
2019-01-18 11:58:35.060 Tor logged: "Jan 18 11:58:35.059 [debug] retry_all_listeners: 1 replacements - starting rebinding loop.", waiting for "Opened Socks listener"
2019-01-18 11:58:35.060 Tor logged: "Jan 18 11:58:35.060 [notice] Opening Socks listener on 0.0.0.0:58579", waiting for "Opened Socks listener"
2019-01-18 11:58:35.060 Tor logged: "Jan 18 11:58:35.060 [debug] connection_add_impl: new conn type Socks listener, socket 7, address 0.0.0.0, n_conns 4.", waiting for "Opened Socks listener"
2019-01-18 11:58:35.060 Tor logged: "Jan 18 11:58:35.060 [debug] connection_listener_new: Socks listener listening on port 58579.", waiting for "Opened Socks listener"
2019-01-18 11:58:35.060 Tor logged: "Jan 18 11:58:35.060 [notice] Opened Socks listener on 0.0.0.0:58579"
2019-01-18 11:58:35.060 FAIL
Cannot connect to SOCKSPort: error Connection reset by peer

setrlimit() failed to raise number of open file descriptors we can have. Note that it tries to raise it to LLONG_MAX (http://pubs.opengroup.org/onlinepubs/009604499/basedefs/limits.h.html), which is suspicious. Perhaps that causes listener failure to happen? Perhaps we should sanity check rlim.rlim_max that we get from getrlimit() before trying to use it?

comment:21 Changed 4 weeks ago by teor

I thought we already checked rlim_max, particularly on Apple.
But maybe some of our recent changes have skipped that check?

Note: See TracTickets for help on using tickets.