Tor does not reconnect after network loss with guards used as bridges

changed milestone to %Tor: unspecified

added 033-removed-20180320 033-triage-20180320 component::core tor/tor flashproxy milestone::Tor: unspecified priority::high resolution::fixed severity::normal sponsor8-maybe status::closed tbb-usability tor-bridges tor-client type::defect labels

Btw, I first started noticing this problem in recent 0.2.4.x releases, but I also switched to using bridges around that timeframe for testing purposes, so I am not fully sure if this is a regression or if this issue has always been there.

May be a dup of #10993 (moved)?

Trac:
Keywords: tbb-needs deleted, tbb-needs 024-backport added
Milestone: N/A to Tor: 0.2.5.x-final

Can anybody give me step-by-step instructions on reproducing this, ideally on a laptop where I can easily turn the wireless on and off? I just tried doing it in the way that seemed obvious to me, but apparently it wasn't as obvious as all that.

Trac:
Keywords: tbb-needs 024-backport deleted, tbb-needs 024-backport 025-triaged added

Trac:
Keywords: tbb-needs 024-backport 025-triaged deleted, tbb-needs 024-backport 025-triaged flashproxy added

We really need easy instructions on how to reproduce these, or we probably won't be able to get a fix done for 0.2.5.

Trac:
Keywords: tbb-needs 024-backport 025-triaged flashproxy deleted, 025-triaged, 024-backport, flashproxy, tbb-needs, 025-deferrable added
Status: new to needs_information

Trac:
oops.go

pluggable transport that drops its first few connections

Trac:
oops-torrc

oops.go is a pluggable transport that reproduces the refusal to reconnect. It drops its first -n connections after -t seconds. After that it works as a pass-through dummy transport. oops-torrc is a configuration for it.

To build and run:

apt-get install golang (or http://golang.org/doc/install)
export GOPATH=$PWD/go
go get
go build
tor -f oops-torrc

The command line in oops-torrc is ./oops -t 3s -n 10 --log oops.log. When I run tor, I get this in oops.log:

2014/04/23 20:38:37 starting [./oops -t 3s -n 10 --log oops.log]
2014/04/23 20:38:39 got connection 0
2014/04/23 20:38:42 oops! connection 0
2014/04/23 20:38:49 got connection 1
2014/04/23 20:38:52 oops! connection 1

and this in the tor output:

Apr 23 20:38:52.000 [info] circuit_build_failed(): Our circuit died before the first hop with no connection
Apr 23 20:38:52.000 [info] entry_guard_register_connect_status(): Unable to connect to entry guard '3VXRyxz67OeRoqHn' (86FA348B038B6A04F2F50135BF84BB74EF63485B
). Marking as unreachable.
Apr 23 20:38:53.000 [notice] Our directory information is no longer up-to-date enough to build circuits: We have no usable consensus.
Apr 23 20:39:00.000 [info] compute_weighted_bandwidths(): Empty routerlist passed in to consensus weight node selection for rule weight as guard
Apr 23 20:39:00.000 [info] smartlist_choose_node_by_bandwidth(): Empty routerlist passed in to old node selection for rule weight as guard
Apr 23 20:39:00.000 [info] should_delay_dir_fetches(): Delaying dir fetches (no running bridges known)
Apr 23 20:39:00.000 [info] compute_weighted_bandwidths(): Empty routerlist passed in to consensus weight node selection for rule weight as guard
Apr 23 20:39:00.000 [info] smartlist_choose_node_by_bandwidth(): Empty routerlist passed in to old node selection for rule weight as guard
Apr 23 20:39:00.000 [info] should_delay_dir_fetches(): Delaying dir fetches (no running bridges known)
...on and on...

You might have to play with the timeout and number of connections dropped. Try deleting datadir if it doesn't work the first time.

tor seems to hang indefinitely if a disconnection happens during the fetching of descriptors (above 50% bootstrapped). A HUP will get it to make some more progress. I can also stimulate it into trying again by making a new SOCKS connection (curl --socks5-hostname localhost:9099 http://www.example.com). I haven't figured out how to reproduce the issue where tor doesn't reconnect even though it gets new SOCKS connections, though I too have seen that before. It might have to do with the "Tried for 120 seconds to get a connection" error from #10993 (moved).

Trac:
Status: needs_information to new

Yikes; it looks given the information above like this is going to be one with a lot of variables to experiment with, especially if provoking the bug is nondeterministic. Current plan is to try to reproduce this a bit for 0.2.5, see if it turns out to be something simple to figure out and fix, and if not, defer till 0.2.6.

Trac:
Keywords: 024-backport deleted, N/A added

I think this is the same as #3259 (moved) ?

Replying to arma:

I think this is the same as #3259 (moved) ?

From what I've seen, that's a plausible explanation.

From IRC discussion:

armadev | it's easy to trigger. just do any of the things that
        | causes tor to mark the relay as not running. then tor
        | won't try to connect to it.

armadev | a fix might be to mark all your bridges up if you have
        | bridges, they're all down, and you get a new stream
        | request

armadev | another fix might be to not mark your last bridge as
        | down unless you really mean it
armadev | i like that one more

armadev | but we also want to stop tor from thrashing if its
        | network is actually down (meaning all its bridges
        | really are unreachable)

armadev | i think we'd make some good progress if we
        | distinguished "there was a network error attempting to
        | establish the tcp connection" from "i gave up on the
        | circuit because it had been a while, but i did have a
        | tcp connection"

#3259 (moved) is the non-pt case, but recent discussion has been centered around PTs, and arma notes that there is probably a lot of overlap.

For every PT apart from meek (and possibly flashproxy), if conn->proxy_state != PROXY_OK would be sufficient to establish if the upstream connection was established. A few of pt implementations also will send back appropriate SOCKSv5 error codes (Host Unreachable/Network Unreachable) but the tor side only uses that information for logging currently.

meek loses out here because it blindly sends back a SOCKSv4 success response right after receiving the request.

Dunno. Maybe I'm overthinking things.

Putting this and #3259 (moved) into 0.2.???

Trac:
Milestone: Tor: 0.2.5.x-final to Tor: 0.2.???

Yawning + others: I spent some time trying to track down whatever I'm experiencing and made some progress. When it happens to me, my control port tells me that all of the guards are live (via 'getinfo entry-guards'), but the info logs indicate that no guards are available for use in a circuit, and that an empty list is being passed into node_sl_choose_by_bandwidth() and subfunctions.

I created a branch with some additional info logs in mikeperry/bug11301-logs. That branch is off of maint-0.2.4 with all of the TBB/PT backport patches applied, as well as logging instrumentation in choose_random_entry_impl() (which is where the empty guard list is passed to node_sl_choose_by_bandwidth()). I'm still waiting for it to happen again. It seems to happen only intermittently (like once every few days), but when it does happen, it keeps happening until I completely restart Tor.

Figured I'd post my logging branch while I wait for another instance of the issue in case anyone else wants to join in on the hunt.

Tor does not reconnect after network loss with guards used as bridges

Child items 0

Activity