Yawning and I have both noticed that tor can become unresponsive if either normal tor bridges or PT bridges are configured, and the client suffers a network connectivity loss. After sustained network connectivity loss, all of the orconns end up closed, and Tor will not try to reconnect to its bridges, even when new stream attempts arrive.
It is possible that Tor is simply marking all of its bridges down in this case, and is not trying to reconnect to them when the network connectivity returns, thinking they are still down?
The only way to solve this issue is to either send "SIGNAL HUP" to the control port, or to kill -HUP pidof tor. After recieving the HUP signal, tor immediately launches new orconns and circuits for its bridges, and attaches the currently pending streams to these new circuits.
Sometimes, after this problem has happened once, tor will cease building circuits even if the network remains available.
This is extremely bad for usability, because TBB becomes completely unusable in this case, and the only thing a normal user can do is exit the whole browser and re-launch it.
This may also indicate a deeper bug with how Tor handles the liveness/'down' status of normal Guard nodes, and may cause Tor to rotate Guards more frequently than necessary.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
Btw, I first started noticing this problem in recent 0.2.4.x releases, but I also switched to using bridges around that timeframe for testing purposes, so I am not fully sure if this is a regression or if this issue has always been there.
Can anybody give me step-by-step instructions on reproducing this, ideally on a laptop where I can easily turn the wireless on and off? I just tried doing it in the way that seemed obvious to me, but apparently it wasn't as obvious as all that.
oops.go is a pluggable transport that reproduces the refusal to reconnect. It drops its first -n connections after -t seconds. After that it works as a pass-through dummy transport. oops-torrc is a configuration for it.
Apr 23 20:38:52.000 [info] circuit_build_failed(): Our circuit died before the first hop with no connectionApr 23 20:38:52.000 [info] entry_guard_register_connect_status(): Unable to connect to entry guard '3VXRyxz67OeRoqHn' (86FA348B038B6A04F2F50135BF84BB74EF63485B). Marking as unreachable.Apr 23 20:38:53.000 [notice] Our directory information is no longer up-to-date enough to build circuits: We have no usable consensus.Apr 23 20:39:00.000 [info] compute_weighted_bandwidths(): Empty routerlist passed in to consensus weight node selection for rule weight as guardApr 23 20:39:00.000 [info] smartlist_choose_node_by_bandwidth(): Empty routerlist passed in to old node selection for rule weight as guardApr 23 20:39:00.000 [info] should_delay_dir_fetches(): Delaying dir fetches (no running bridges known)Apr 23 20:39:00.000 [info] compute_weighted_bandwidths(): Empty routerlist passed in to consensus weight node selection for rule weight as guardApr 23 20:39:00.000 [info] smartlist_choose_node_by_bandwidth(): Empty routerlist passed in to old node selection for rule weight as guardApr 23 20:39:00.000 [info] should_delay_dir_fetches(): Delaying dir fetches (no running bridges known)...on and on...
You might have to play with the timeout and number of connections dropped. Try deleting datadir if it doesn't work the first time.
tor seems to hang indefinitely if a disconnection happens during the fetching of descriptors (above 50% bootstrapped). A HUP will get it to make some more progress. I can also stimulate it into trying again by making a new SOCKS connection (curl --socks5-hostname localhost:9099 http://www.example.com). I haven't figured out how to reproduce the issue where tor doesn't reconnect even though it gets new SOCKS connections, though I too have seen that before. It might have to do with the "Tried for 120 seconds to get a connection" error from #10993 (moved).
Yikes; it looks given the information above like this is going to be one with a lot of variables to experiment with, especially if provoking the bug is nondeterministic. Current plan is to try to reproduce this a bit for 0.2.5, see if it turns out to be something simple to figure out and fix, and if not, defer till 0.2.6.
From what I've seen, that's a plausible explanation.
From IRC discussion:
armadev | it's easy to trigger. just do any of the things that | causes tor to mark the relay as not running. then tor | won't try to connect to it.armadev | a fix might be to mark all your bridges up if you have | bridges, they're all down, and you get a new stream | requestarmadev | another fix might be to not mark your last bridge as | down unless you really mean itarmadev | i like that one morearmadev | but we also want to stop tor from thrashing if its | network is actually down (meaning all its bridges | really are unreachable)armadev | i think we'd make some good progress if we | distinguished "there was a network error attempting to | establish the tcp connection" from "i gave up on the | circuit because it had been a while, but i did have a | tcp connection"
#3259 (moved) is the non-pt case, but recent discussion has been centered around PTs, and arma notes that there is probably a lot of overlap.
For every PT apart from meek (and possibly flashproxy), if conn->proxy_state != PROXY_OK would be sufficient to establish if the upstream connection was established. A few of pt implementations also will send back appropriate SOCKSv5 error codes (Host Unreachable/Network Unreachable) but the tor side only uses that information for logging currently.
meek loses out here because it blindly sends back a SOCKSv4 success response right after receiving the request.
Yawning + others: I spent some time trying to track down whatever I'm experiencing and made some progress. When it happens to me, my control port tells me that all of the guards are live (via 'getinfo entry-guards'), but the info logs indicate that no guards are available for use in a circuit, and that an empty list is being passed into node_sl_choose_by_bandwidth() and subfunctions.
I created a branch with some additional info logs in mikeperry/bug11301-logs. That branch is off of maint-0.2.4 with all of the TBB/PT backport patches applied, as well as logging instrumentation in choose_random_entry_impl() (which is where the empty guard list is passed to node_sl_choose_by_bandwidth()). I'm still waiting for it to happen again. It seems to happen only intermittently (like once every few days), but when it does happen, it keeps happening until I completely restart Tor.
Figured I'd post my logging branch while I wait for another instance of the issue in case anyone else wants to join in on the hunt.