Opened 4 years ago

Last modified 6 months ago

#11301 new defect

Tor does not reconnect after network loss with guards used as bridges

Reported by: mikeperry Owned by: nickm
Priority: High Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-bridges, tor-client, tbb-usability, flashproxy, sponsor8-maybe
Cc: yawning, athena, isis Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Yawning and I have both noticed that tor can become unresponsive if either normal tor bridges or PT bridges are configured, and the client suffers a network connectivity loss. After sustained network connectivity loss, all of the orconns end up closed, and Tor will not try to reconnect to its bridges, even when new stream attempts arrive.

It is possible that Tor is simply marking all of its bridges down in this case, and is not trying to reconnect to them when the network connectivity returns, thinking they are still down?

The only way to solve this issue is to either send "SIGNAL HUP" to the control port, or to kill -HUP pidof tor. After recieving the HUP signal, tor immediately launches new orconns and circuits for its bridges, and attaches the currently pending streams to these new circuits.

Sometimes, after this problem has happened once, tor will cease building circuits even if the network remains available.

This is extremely bad for usability, because TBB becomes completely unusable in this case, and the only thing a normal user can do is exit the whole browser and re-launch it.

This may also indicate a deeper bug with how Tor handles the liveness/'down' status of normal Guard nodes, and may cause Tor to rotate Guards more frequently than necessary.

Child Tickets

Attachments (2)

oops.go (3.5 KB) - added by dcf 4 years ago.
pluggable transport that drops its first few connections
oops-torrc (164 bytes) - added by dcf 4 years ago.

Download all attachments as: .zip

Change History (30)

comment:1 Changed 4 years ago by mikeperry

Btw, I first started noticing this problem in recent 0.2.4.x releases, but I also switched to using bridges around that timeframe for testing purposes, so I am not fully sure if this is a regression or if this issue has always been there.

comment:2 Changed 4 years ago by mikeperry

May be a dup of #10993?

comment:3 Changed 4 years ago by nickm

Keywords: 024-backport added
Milestone: Tor: 0.2.5.x-final

comment:4 Changed 4 years ago by nickm

Can anybody give me step-by-step instructions on reproducing this, ideally on a laptop where I can easily turn the wireless on and off? I just tried doing it in the way that seemed obvious to me, but apparently it wasn't as obvious as all that.

comment:5 Changed 4 years ago by nickm

Keywords: 025-triaged added

comment:6 Changed 4 years ago by dcf

Keywords: flashproxy added

comment:7 Changed 4 years ago by nickm

Keywords: 025-deferrable added
Status: newneeds_information

We really need easy instructions on how to reproduce these, or we probably won't be able to get a fix done for 0.2.5.

Changed 4 years ago by dcf

Attachment: oops.go added

pluggable transport that drops its first few connections

Changed 4 years ago by dcf

Attachment: oops-torrc added

comment:8 Changed 4 years ago by dcf

attachment:oops.go is a pluggable transport that reproduces the refusal to reconnect. It drops its first -n connections after -t seconds. After that it works as a pass-through dummy transport. attachment:oops-torrc is a configuration for it.

To build and run:

apt-get install golang (or http://golang.org/doc/install)
export GOPATH=$PWD/go
go get
go build
tor -f oops-torrc

The command line in oops-torrc is ./oops -t 3s -n 10 --log oops.log. When I run tor, I get this in oops.log:

2014/04/23 20:38:37 starting [./oops -t 3s -n 10 --log oops.log]
2014/04/23 20:38:39 got connection 0
2014/04/23 20:38:42 oops! connection 0
2014/04/23 20:38:49 got connection 1
2014/04/23 20:38:52 oops! connection 1

and this in the tor output:

Apr 23 20:38:52.000 [info] circuit_build_failed(): Our circuit died before the first hop with no connection
Apr 23 20:38:52.000 [info] entry_guard_register_connect_status(): Unable to connect to entry guard '3VXRyxz67OeRoqHn' (86FA348B038B6A04F2F50135BF84BB74EF63485B
). Marking as unreachable.
Apr 23 20:38:53.000 [notice] Our directory information is no longer up-to-date enough to build circuits: We have no usable consensus.
Apr 23 20:39:00.000 [info] compute_weighted_bandwidths(): Empty routerlist passed in to consensus weight node selection for rule weight as guard
Apr 23 20:39:00.000 [info] smartlist_choose_node_by_bandwidth(): Empty routerlist passed in to old node selection for rule weight as guard
Apr 23 20:39:00.000 [info] should_delay_dir_fetches(): Delaying dir fetches (no running bridges known)
Apr 23 20:39:00.000 [info] compute_weighted_bandwidths(): Empty routerlist passed in to consensus weight node selection for rule weight as guard
Apr 23 20:39:00.000 [info] smartlist_choose_node_by_bandwidth(): Empty routerlist passed in to old node selection for rule weight as guard
Apr 23 20:39:00.000 [info] should_delay_dir_fetches(): Delaying dir fetches (no running bridges known)
...on and on...

You might have to play with the timeout and number of connections dropped. Try deleting datadir if it doesn't work the first time.

tor seems to hang indefinitely if a disconnection happens during the fetching of descriptors (above 50% bootstrapped). A HUP will get it to make some more progress. I can also stimulate it into trying again by making a new SOCKS connection (curl --socks5-hostname localhost:9099 http://www.example.com). I haven't figured out how to reproduce the issue where tor doesn't reconnect even though it gets new SOCKS connections, though I too have seen that before. It might have to do with the "Tried for 120 seconds to get a connection" error from #10993.

comment:9 Changed 4 years ago by nickm

Status: needs_informationnew

comment:10 Changed 4 years ago by nickm

Yikes; it looks given the information above like this is going to be one with a lot of variables to experiment with, especially if provoking the bug is nondeterministic. Current plan is to try to reproduce this a bit for 0.2.5, see if it turns out to be something simple to figure out and fix, and if not, defer till 0.2.6.

comment:11 Changed 4 years ago by nickm

Keywords: 024-backport removed

comment:12 Changed 4 years ago by arma

I think this is the same as #3259 ?

comment:13 in reply to:  12 Changed 4 years ago by yawning

Replying to arma:

I think this is the same as #3259 ?

From what I've seen, that's a plausible explanation.

From IRC discussion:

armadev | it's easy to trigger. just do any of the things that
        | causes tor to mark the relay as not running. then tor
        | won't try to connect to it.

armadev | a fix might be to mark all your bridges up if you have
        | bridges, they're all down, and you get a new stream
        | request

armadev | another fix might be to not mark your last bridge as
        | down unless you really mean it
armadev | i like that one more

armadev | but we also want to stop tor from thrashing if its
        | network is actually down (meaning all its bridges
        | really are unreachable)

armadev | i think we'd make some good progress if we
        | distinguished "there was a network error attempting to
        | establish the tcp connection" from "i gave up on the
        | circuit because it had been a while, but i did have a
        | tcp connection"

#3259 is the non-pt case, but recent discussion has been centered around PTs, and arma notes that there is probably a lot of overlap.

For every PT apart from meek (and possibly flashproxy), if conn->proxy_state != PROXY_OK would be sufficient to establish if the upstream connection was established. A few of pt implementations also will send back appropriate SOCKSv5 error codes (Host Unreachable/Network Unreachable) but the tor side only uses that information for logging currently.

meek loses out here because it blindly sends back a SOCKSv4 success response right after receiving the request.

Dunno. Maybe I'm overthinking things.

comment:14 Changed 3 years ago by nickm

Milestone: Tor: 0.2.5.x-finalTor: 0.2.???

Putting this and #3259 into 0.2.???

comment:15 Changed 3 years ago by mikeperry

Yawning + others: I spent some time trying to track down whatever I'm experiencing and made some progress. When it happens to me, my control port tells me that all of the guards are live (via 'getinfo entry-guards'), but the info logs indicate that no guards are available for use in a circuit, and that an empty list is being passed into node_sl_choose_by_bandwidth() and subfunctions.

I created a branch with some additional info logs in mikeperry/bug11301-logs. That branch is off of maint-0.2.4 with all of the TBB/PT backport patches applied, as well as logging instrumentation in choose_random_entry_impl() (which is where the empty guard list is passed to node_sl_choose_by_bandwidth()). I'm still waiting for it to happen again. It seems to happen only intermittently (like once every few days), but when it does happen, it keeps happening until I completely restart Tor.

Figured I'd post my logging branch while I wait for another instance of the issue in case anyone else wants to join in on the hunt.

comment:16 Changed 3 years ago by isis

Cc: isis added

comment:17 Changed 3 years ago by arma

I have several times gotten the impression that 'getinfo entry-guards' is not telling people accurate information. I assume it was a case where we had a ticket open that was vague, somebody showed up to implement something, nickm decided that the something was not harmful to merge, and then the ticket got closed.

As an aside, is #14216 an adequate explanation for this ticket?

comment:18 Changed 3 years ago by isis

I got hit by this bug super badly today, using Tor version 0.2.5.10 (git-42b42605f8d8eac2); tor decided roughly every 30 seconds to 2 minutes that there weren't any "bridges" available.

FWIW, I've experienced this bug for over a year now. The most reliable way that I know of to trigger it is to use normal relays as bridges, although it does occasionally occur with real bridge relays. I remember vaguely looking into it and thinking that somewhere a list of available relays and another list of available bridges existed simultaneously, and then basically that the bridge list was either deduplicated or checked for normal relays, causing some or all of the "bridges" to end up back in the real relay list, and thus causing Tor to think that there weren't any of the configured bridges available.

comment:19 Changed 3 years ago by isis

Keywords: isis2015Q2 added

comment:20 in reply to:  18 Changed 3 years ago by arma

Replying to isis:

The most reliable way that I know of to trigger it is to use normal relays as bridges

That is an unsupported approach with known bugs. See e.g. #1776. I mean, feel free to do it, but when it breaks you get to keep the pieces. And I worry that people using this unsupported approach might be masking or distracting bugs in actual bridge use.

I remember vaguely looking into it and thinking that somewhere a list of available relays and another list of available bridges existed simultaneously, and then basically that the bridge list was either deduplicated or checked for normal relays, causing some or all of the "bridges" to end up back in the real relay list, and thus causing Tor to think that there weren't any of the configured bridges available.

Sounds like you're describing #1776.

comment:21 Changed 3 years ago by nickm

Milestone: Tor: 0.2.???Tor: 0.2.7.x-final

tbb-wants, so marking as consideration

comment:22 Changed 3 years ago by mikeperry

Keywords: tbb-wants added; tbb-needs removed

comment:23 Changed 3 years ago by mikeperry

Keywords: tbb-wants removed
Milestone: Tor: 0.2.7.x-finalTor: unspecified

Actually, I think this bug may be specific to using normal tor nodes as bridges, which is not a typical behavior. I don't think this happens with normal bridges. I am removing the tbb-wants tag and 0.2.7 milestone for that reason.

comment:24 Changed 3 years ago by mikeperry

Summary: Tor does not reconnect after network loss with bridgesTor does not reconnect after network loss with guards used as bridges

comment:25 in reply to:  24 Changed 3 years ago by isis

Replying to mikeperry:

Related: #1776

comment:26 Changed 6 months ago by nickm

Keywords: 025-triaged removed

remove an old triage keyword.

comment:27 Changed 6 months ago by nickm

Keywords: 025-deferrable isis2015Q2 removed
Severity: Normal

comment:28 Changed 6 months ago by nickm

Keywords: tor-bridges sponsor8-maybe added
Note: See TracTickets for help on using tickets.