Opened 6 years ago

Closed 5 years ago

#11069 closed defect (fixed)

Tor with unreachable PTs should not hang the bootstrap process

Reported by: asn Owned by:
Priority: Medium Milestone: Tor: 0.2.5.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: 024-backport tor-pt tor-client, tbb-needs, tbb-tor-backported-3.6b1
Cc: mcs, brade, intrigeri@… Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

We got reports that when Tor is used with PTs and the PTs are unreachable, the bootstrap process does not error out like it does for normal bridges. Instead, it hangs and waits about, which makes it harder for software like tor-launcher to distinguish between connection failure and connection delays.

We should verify those reports, and make sure that PT connections also timeout after a while.

Child Tickets

Attachments (1)

bug11069_idea.patch (2.5 KB) - added by asn 6 years ago.

Download all attachments as: .zip

Change History (29)

comment:1 Changed 6 years ago by arma

Giving us some steps to reproduce would be a great first start. I assume it's a case of Tor not realizing it needs to output a bootstrap status event.

comment:3 Changed 6 years ago by nickm

To be clear, do we mean that the local proxy is running but unreachable, the bridge is unreachable by the proxy, or some third possibility?

comment:4 Changed 6 years ago by mikeperry

Cc: mcs brade added

mcs,brade: do you have reproducible instructions? Does this issue happen with plain bridges too? Normal Tor connections fail bootstrap properly and timely (in under a minute), in my testing. ASN reports that he doesn't see log messages for 5 minutes for an unreachable PT bridge with just a tor client, but we're not sure if there is an additional control port issue beyond the timeout discrepancy.

comment:5 Changed 6 years ago by brade

It isn't specific to PT. If you enter a bad bridge (as the only bridge), Tor Launcher will wait indefinitely for tor to complete the bootstrap process.

Here is a torrc to use for testing (adjust paths and make sure nothing is listening on TCP port 9532:

Bridge 127.0.0.1:9532
DataDirectory /Users/foo/TBB.app/Data/Tor
DirReqStatistics 0
GeoIPFile /Users/foo/TBB.app/Data/Tor/geoip
UseBridges 1

Here are the bootstrap status event messages that Tor Launcher received:

650 NOTICE Bootstrapped 5%: Connecting to directory server. 
650 STATUS_CLIENT NOTICE BOOTSTRAP PROGRESS=5 TAG=conn_dir SUMMARY="Connecting to directory server"
650 STATUS_CLIENT WARN BOOTSTRAP PROGRESS=5 TAG=conn_dir SUMMARY="Connecting to directory server" WARNING="Connection refused" REASON=CONNECTREFUSED COUNT=1 RECOMMENDATION=ignore

comment:6 Changed 6 years ago by asn

Hm, maybe

650 STATUS_CLIENT WARN BOOTSTRAP PROGRESS=5 TAG=conn_dir SUMMARY="Connecting to directory server" WARNING="Connection refused" REASON=CONNECTREFUSED COUNT=1 RECOMMENDATION=ignore

is the relevant bootstrap status message that we are missing? Is the problem that it's RECOMMENDATION=ignore?

Looking at control_event_bootstrap_problem() there seems to be a block of code:

  if (get_options()->UseBridges &&
      !any_bridge_descriptors_known() &&
      !any_pending_bridge_descriptor_fetches())
    recommendation = "warn";

which seems to be relevant to our interests. I did a few tests and it seems that any_pending_bridge_descriptor_fetches() always returns true, so the recommendation is never set to warn.

Could it be that we always have a bridge descriptor fetching connection open? Or that we call control_event_bootstrap_problem() before closing that connection? Or something like that?

Changed 6 years ago by asn

Attachment: bug11069_idea.patch added

comment:7 Changed 6 years ago by asn

We discussed this with nickm today, and it seems that any_pending_bridge_descriptor_fetches() is the problem. The function goes through the connection list and searches for open directory connections, and it seems that Tor opens directory connections for bridges which get closed after their OR connections close. So at the time of the function call the directory connections are still open.

Nick suggested cycling through the OR connections instead and checking if there are any active ones. Apparently the OR conns are the ones that get marked as closed first if the bridges are not reachable (not sure why).

I attached a patch that tries this idea and seems to work. Unfortunately, I don't really understand how these directory/OR connections are started up, so the patch might as well be wrong. I tested it by adding two unreachable bridges, and seeing if I get two warnings: instead I got one, which is what we wanted. I also tried adding two unreachable bridges and one reachable, and I got no warning, which is what we wanted. So from a behavior PoV it seems to work, but I'm not at all sure it's the correct fix.

Note that I moved connection_or_notify_error() after connection_mark_for_close_internal() otherwise the new any_pending_bridge_descriptor_fetches() was finding itself and returning 1. Maybe, instead of moving connection_or_notify_error() we should check that we don't find ourselves?

Also are there any other attributes of the OR connection that we should look for when searching for bridge OR connections?

BTW, it's worth mentioning that if the TBB uses this message as a hook that Tor cannot bootstrap, we should make sure that it only appears in that occasion. For example, if we have 11 bridges, and the first 10 bridges fail but the last one is legit, this message will still appear as a warning because of:

  if (bootstrap_problems >= BOOTSTRAP_PROBLEM_THRESHOLD)
    recommendation = "warn";

comment:8 Changed 6 years ago by asn

(I will try to look more into this on Friday.)

comment:9 Changed 6 years ago by asn

OK, I did some digging and found out that if you start Tor with bridges, Tor will start OR connections for each bridge. It does so because of fetch_bridge_descriptors() -> launch_direct_bridge_descriptor_fetch() -> ... -> circuit_launch_by_extend_info().

In this light, checking if 'we are using bridges && no bridge descriptors && we are the last OR connection' in control_event_bootstrap_problem() kind of makes sense as a fix. Maybe we should make a function called we_are_last_or_conn(const or_connection_t *conn) that cycles through the connection list, and sees if there are any active OR conns apart from conn? Then use that function instead of any_pending_bridge_descriptor_fetches()?

Otherwise, we could mark the current connection as closed before stepping into control_event_bootstrap_problem() and then checking if there are any active OR connections left.

Also, is there anything that would create extra OR connections during bootstrap apart from bridges? We should make sure that there is not, because with the current proposed fix we would consider those connection as connections to bridges and not raise the warning.

Nick?

comment:10 in reply to:  9 Changed 6 years ago by nickm

Replying to asn:

OK, I did some digging and found out that if you start Tor with bridges, Tor will start OR connections for each bridge. It does so because of fetch_bridge_descriptors() -> launch_direct_bridge_descriptor_fetch() -> ... -> circuit_launch_by_extend_info().

In this light, checking if 'we are using bridges && no bridge descriptors && we are the last OR connection' in control_event_bootstrap_problem() kind of makes sense as a fix. Maybe we should make a function called we_are_last_or_conn(const or_connection_t *conn) that cycles through the connection list, and sees if there are any active OR conns apart from conn? Then use that function instead of any_pending_bridge_descriptor_fetches()?

Otherwise, we could mark the current connection as closed before stepping into control_event_bootstrap_problem() and then checking if there are any active OR connections left.

I like this solution best.

Also, is there anything that would create extra OR connections during bootstrap apart from bridges? We should make sure that there is not, because with the current proposed fix we would consider those connection as connections to bridges and not raise the warning.

If bridges are in use, we are not supposed to be connecting anywhere that isn't a bridge.

comment:11 Changed 6 years ago by asn

Status: newneeds_review

OK. See branch bug11069 in https://git.torproject.org/user/asn/tor.git.

After some testing it seems to work fine.

A thing I'm not so sure about is that I moved connection_or_notify_error() after the connection has been marked for closing. There doesn't seem to be anything wrong with that, but that function also calls channel_close_for_error() and I'm not sure if the channel subsystem cares about whether the connection has been closed or not.

Also, are there any other attributes of the connection that we should be checking in any_active_or_conns()? We could check whether its state is OR_CONN_STATE_CONNECTING for example, but I'm not sure if this is a good idea.

comment:12 Changed 6 years ago by intrigeri

Cc: intrigeri@… added

FWIW, this might be a related to https://trac.torproject.org/projects/tor/ticket/1247.

comment:13 Changed 6 years ago by mikeperry

Keywords: tbb-needs added

comment:14 in reply to:  11 Changed 6 years ago by andrea

Replying to asn:

OK. See branch bug11069 in https://git.torproject.org/user/asn/tor.git.

After some testing it seems to work fine.

A thing I'm not so sure about is that I moved connection_or_notify_error() after the connection has been marked for closing. There doesn't seem to be anything wrong with that, but that function also calls channel_close_for_error() and I'm not sure if the channel subsystem cares about whether the connection has been closed or not.

Also, are there any other attributes of the connection that we should be checking in any_active_or_conns()? We could check whether its state is OR_CONN_STATE_CONNECTING for example, but I'm not sure if this is a good idea.

Bad things will happen if the connection gets closed without channels being notified properly because then there will be a channel_t hanging around with a bad pointer to an or_connection_t that doesn't exist any more. You have to ensure that you're calling channel_close_for_error() or one of the other close-from-below functions when you close an or_connection_t.

It looks like you're doing that and changing the order, with connection_mark_for_close_internal() before channel_close_for_error() - I *believe* this is safe *as long as you're certain* that channel_close_for_error() happens before close_closeable_connections() gets to that orconn. Is it really necessary to structure it like that though, different from the order of all the other calls to channel_close_for_error() in connection.c? If so, perhaps an explanatory comment would be advisable.

comment:15 Changed 6 years ago by asn

I moved the connection_or_notify_error() below connection_mark_for_close_internal() because I wanted the connection to be already closed when the new any_active_orconns() function gets called. Otherwise, any_active_orconns() would consider the current connection (that is about to be closed) as an active connection.

We could avoid this problem by making any_active_orconns() smarter and passing it a connection as an argument and making it consider that connection as closed (maybe rename it to any_other_active_orconns(or_connection_t *conn)). I suggested this in comment:9 and nick was not very enthusiastic, which makes sense, since the solution is not elegant.

Unfortunately, the rest of the solutions I can think of (that don't require big refactoring) are not very elegant either. Solutions involve restoring the previous order of the function but calling control_event_bootstrap_problem() after we've closed the connection. But this would call control_event_bootstrap_problem() twice, except if we modified the code so that it doesn't happen.

Other solutions involve moding connection_or_notify_error() to decouple connection_or_connect_failed() from channel_close_for_error(). But this would require too much modification in other parts of the code.

Any other ideas?

comment:16 Changed 6 years ago by nickm

I suppose that on an 0.2.5 timeframe, any_other_active_orconns() is not _too_ horrible. At least, it has a very low risk of breaking other tricky code. :/

comment:17 in reply to:  16 ; Changed 6 years ago by asn

Replying to nickm:

I suppose that on an 0.2.5 timeframe, any_other_active_orconns() is not _too_ horrible. At least, it has a very low risk of breaking other tricky code. :/

OK. How do we feel about bug11069_take2 in my repo?

I made an entirely new branch for this. If you want to compare it with the previous one, try git diff bug11069..bug11069_take2.

comment:18 in reply to:  17 Changed 6 years ago by andrea

Replying to asn:

Replying to nickm:

I suppose that on an 0.2.5 timeframe, any_other_active_orconns() is not _too_ horrible. At least, it has a very low risk of breaking other tricky code. :/

OK. How do we feel about bug11069_take2 in my repo?

I made an entirely new branch for this. If you want to compare it with the previous one, try git diff bug11069..bug11069_take2.

any_other_active_orconns() seems acceptable to me for now; it makes me a lot less nervous than calling connection_mark_for_close_internal() while a channel still thinks the connection is open does.

comment:19 Changed 6 years ago by nickm

Milestone: Tor: 0.2.5.x-finalTor: 0.2.4.x-final

Merging to 0.2.5. if it doesn't explode anything, I gather that Mike wants a backport to 0.2.4. That might be nontrivial; I tried to cherry-pick it and got a bunch of conflicts. Trying again...

comment:20 Changed 6 years ago by nickm

Okay, "bug11069_024" in my public is a backport candidate. asn, could you please have a look at it for sanity?

comment:21 in reply to:  20 Changed 6 years ago by asn

Replying to nickm:

Okay, "bug11069_024" in my public is a backport candidate. asn, could you please have a look at it for sanity?

Looks good to me.

comment:22 Changed 6 years ago by andrea

Keywords: 024-backport added

comment:23 Changed 5 years ago by mikeperry

FYI: We have been applying the nickm/bug11069_024 version of this patch (67c70b2566fc9bef4527fb8a0c24ce7d8c4d0647) to the tor used by TBB since 3.6-beta-1. As far I know it has not introduced any other issues.

comment:24 Changed 5 years ago by mikeperry

Keywords: tbb-tor-backported-3.6b1 added

comment:25 Changed 5 years ago by nickm

Recommendation: no backport.

comment:26 Changed 5 years ago by nickm

(persuade arma that this is causing huge problems in 0.2.4 if I'm wrong.)

comment:27 Changed 5 years ago by arma

I'm fine with 'no backport'. Tor 0.2.5.x will be out soon enough, and the TBB people seem to have the issue under control for themselves with their upcoming TBB 3.6.

comment:28 Changed 5 years ago by nickm

Milestone: Tor: 0.2.4.x-finalTor: 0.2.5.x-final
Resolution: fixed
Status: needs_reviewclosed

Okay, no backport to 0.2.4 for these, for stability reasons.

Note: See TracTickets for help on using tickets.