We got reports that when Tor is used with PTs and the PTs are unreachable, the bootstrap process does not error out like it does for normal bridges. Instead, it hangs and waits about, which makes it harder for software like tor-launcher to distinguish between connection failure and connection delays.
We should verify those reports, and make sure that PT connections also timeout after a while.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
mcs,brade: do you have reproducible instructions? Does this issue happen with plain bridges too? Normal Tor connections fail bootstrap properly and timely (in under a minute), in my testing. ASN reports that he doesn't see log messages for 5 minutes for an unreachable PT bridge with just a tor client, but we're not sure if there is an additional control port issue beyond the timeout discrepancy.
It isn't specific to PT. If you enter a bad bridge (as the only bridge), Tor Launcher will wait indefinitely for tor to complete the bootstrap process.
Here is a torrc to use for testing (adjust paths and make sure nothing is listening on TCP port 9532:
is the relevant bootstrap status message that we are missing? Is the problem that it's RECOMMENDATION=ignore?
Looking at control_event_bootstrap_problem() there seems to be a block of code:
if (get_options()->UseBridges && !any_bridge_descriptors_known() && !any_pending_bridge_descriptor_fetches()) recommendation = "warn";
which seems to be relevant to our interests. I did a few tests and it seems that any_pending_bridge_descriptor_fetches() always returns true, so the recommendation is never set to warn.
Could it be that we always have a bridge descriptor fetching connection open? Or that we call control_event_bootstrap_problem() before closing that connection? Or something like that?
We discussed this with nickm today, and it seems that any_pending_bridge_descriptor_fetches() is the problem. The function goes through the connection list and searches for open directory connections, and it seems that Tor opens directory connections for bridges which get closed after their OR connections close. So at the time of the function call the directory connections are still open.
Nick suggested cycling through the OR connections instead and checking if there are any active ones. Apparently the OR conns are the ones that get marked as closed first if the bridges are not reachable (not sure why).
I attached a patch that tries this idea and seems to work. Unfortunately, I don't really understand how these directory/OR connections are started up, so the patch might as well be wrong. I tested it by adding two unreachable bridges, and seeing if I get two warnings: instead I got one, which is what we wanted. I also tried adding two unreachable bridges and one reachable, and I got no warning, which is what we wanted. So from a behavior PoV it seems to work, but I'm not at all sure it's the correct fix.
Note that I moved connection_or_notify_error() after connection_mark_for_close_internal() otherwise the new any_pending_bridge_descriptor_fetches() was finding itself and returning 1. Maybe, instead of moving connection_or_notify_error() we should check that we don't find ourselves?
Also are there any other attributes of the OR connection that we should look for when searching for bridge OR connections?
BTW, it's worth mentioning that if the TBB uses this message as a hook that Tor cannot bootstrap, we should make sure that it only appears in that occasion. For example, if we have 11 bridges, and the first 10 bridges fail but the last one is legit, this message will still appear as a warning because of:
if (bootstrap_problems >= BOOTSTRAP_PROBLEM_THRESHOLD) recommendation = "warn";
OK, I did some digging and found out that if you start Tor with bridges, Tor will start OR connections for each bridge. It does so because of fetch_bridge_descriptors() -> launch_direct_bridge_descriptor_fetch() -> ... -> circuit_launch_by_extend_info().
In this light, checking if 'we are using bridges && no bridge descriptors && we are the last OR connection' in control_event_bootstrap_problem() kind of makes sense as a fix. Maybe we should make a function called we_are_last_or_conn(const or_connection_t *conn) that cycles through the connection list, and sees if there are any active OR conns apart from conn? Then use that function instead of any_pending_bridge_descriptor_fetches()?
Otherwise, we could mark the current connection as closed before stepping into control_event_bootstrap_problem() and then checking if there are any active OR connections left.
Also, is there anything that would create extra OR connections during bootstrap apart from bridges? We should make sure that there is not, because with the current proposed fix we would consider those connection as connections to bridges and not raise the warning.
OK, I did some digging and found out that if you start Tor with bridges, Tor will start OR connections for each bridge. It does so because of fetch_bridge_descriptors() -> launch_direct_bridge_descriptor_fetch() -> ... -> circuit_launch_by_extend_info().
In this light, checking if 'we are using bridges && no bridge descriptors && we are the last OR connection' in control_event_bootstrap_problem() kind of makes sense as a fix. Maybe we should make a function called we_are_last_or_conn(const or_connection_t *conn) that cycles through the connection list, and sees if there are any active OR conns apart from conn? Then use that function instead of any_pending_bridge_descriptor_fetches()?
Otherwise, we could mark the current connection as closed before stepping into control_event_bootstrap_problem() and then checking if there are any active OR connections left.
I like this solution best.
Also, is there anything that would create extra OR connections during bootstrap apart from bridges? We should make sure that there is not, because with the current proposed fix we would consider those connection as connections to bridges and not raise the warning.
If bridges are in use, we are not supposed to be connecting anywhere that isn't a bridge.
OK. See branch bug11069 in https://git.torproject.org/user/asn/tor.git.
After some testing it seems to work fine.
A thing I'm not so sure about is that I moved connection_or_notify_error() after the connection has been marked for closing. There doesn't seem to be anything wrong with that, but that function also calls channel_close_for_error() and I'm not sure if the channel subsystem cares about whether the connection has been closed or not.
Also, are there any other attributes of the connection that we should be checking in any_active_or_conns()? We could check whether its state is OR_CONN_STATE_CONNECTING for example, but I'm not sure if this is a good idea.
OK. See branch bug11069 in https://git.torproject.org/user/asn/tor.git.
After some testing it seems to work fine.
A thing I'm not so sure about is that I moved connection_or_notify_error() after the connection has been marked for closing. There doesn't seem to be anything wrong with that, but that function also calls channel_close_for_error() and I'm not sure if the channel subsystem cares about whether the connection has been closed or not.
Also, are there any other attributes of the connection that we should be checking in any_active_or_conns()? We could check whether its state is OR_CONN_STATE_CONNECTING for example, but I'm not sure if this is a good idea.
Bad things will happen if the connection gets closed without channels being notified properly because then there will be a channel_t hanging around with a bad pointer to an or_connection_t that doesn't exist any more. You have to ensure that you're calling channel_close_for_error() or one of the other close-from-below functions when you close an or_connection_t.
It looks like you're doing that and changing the order, with connection_mark_for_close_internal() before channel_close_for_error() - I believe this is safe as long as you're certain that channel_close_for_error() happens before close_closeable_connections() gets to that orconn. Is it really necessary to structure it like that though, different from the order of all the other calls to channel_close_for_error() in connection.c? If so, perhaps an explanatory comment would be advisable.
I moved the connection_or_notify_error() below connection_mark_for_close_internal() because I wanted the connection to be already closed when the new any_active_orconns() function gets called. Otherwise, any_active_orconns() would consider the current connection (that is about to be closed) as an active connection.
We could avoid this problem by making any_active_orconns() smarter and passing it a connection as an argument and making it consider that connection as closed (maybe rename it to any_other_active_orconns(or_connection_t *conn)). I suggested this in comment:9 and nick was not very enthusiastic, which makes sense, since the solution is not elegant.
Unfortunately, the rest of the solutions I can think of (that don't require big refactoring) are not very elegant either. Solutions involve restoring the previous order of the function but calling control_event_bootstrap_problem() after we've closed the connection. But this would call control_event_bootstrap_problem() twice, except if we modified the code so that it doesn't happen.
Other solutions involve moding connection_or_notify_error() to decouple connection_or_connect_failed() from channel_close_for_error(). But this would require too much modification in other parts of the code.
I suppose that on an 0.2.5 timeframe, any_other_active_orconns() is not too horrible. At least, it has a very low risk of breaking other tricky code. :/
I suppose that on an 0.2.5 timeframe, any_other_active_orconns() is not too horrible. At least, it has a very low risk of breaking other tricky code. :/
OK. How do we feel about bug11069_take2 in my repo?
I made an entirely new branch for this. If you want to compare it with the previous one, try git diff bug11069..bug11069_take2.
I suppose that on an 0.2.5 timeframe, any_other_active_orconns() is not too horrible. At least, it has a very low risk of breaking other tricky code. :/
OK. How do we feel about bug11069_take2 in my repo?
I made an entirely new branch for this. If you want to compare it with the previous one, try git diff bug11069..bug11069_take2.
any_other_active_orconns() seems acceptable to me for now; it makes me a lot less nervous than calling connection_mark_for_close_internal() while a channel still thinks the connection is open does.
Merging to 0.2.5. if it doesn't explode anything, I gather that Mike wants a backport to 0.2.4. That might be nontrivial; I tried to cherry-pick it and got a bunch of conflicts. Trying again...
Trac: Milestone: Tor: 0.2.5.x-final to Tor: 0.2.4.x-final
FYI: We have been applying the nickm/bug11069_024 version of this patch (67c70b2566fc9bef4527fb8a0c24ce7d8c4d0647) to the tor used by TBB since 3.6-beta-1. As far I know it has not introduced any other issues.
I'm fine with 'no backport'. Tor 0.2.5.x will be out soon enough, and the TBB people seem to have the issue under control for themselves with their upcoming TBB 3.6.