Tor eventually hits "No running bridges" with Snowflake

added component::circumvention/snowflake priority::medium resolution::fixed severity::normal status::closed type::defect labels

Some more details: This was tested on Debian. Also the snowflake-client used for Ricochet is a copy placed in /usr/bin.

Thanks for this.

I tried it, but I wasn't able to reproduce it. On Debian, I used a browser from https://people.torproject.org/~dcf/pt-bundle/snowflake/20180321-8.0a4-4a5889af2891/ and a command line snowflake-client from 07291a0136.

Started a download of https://mirrors.edge.kernel.org/centos/7.4.1708/isos/x86_64/CentOS-7-x86_64-Everything-1708.iso. Was getting about 400 KB/s.
In the snowflake/client directory, ran tor -f torrc SOCKSPort 10000. It finished bootstrapping and the Browser download was still running.
Download through the command-line client: curl -O --proxy socks5h://127.0.0.1:10000/ https://mirrors.edge.kernel.org/centos/7.4.1708/isos/x86_64/CentOS-7-x86_64-Everything-1708.iso. I was getting about 500 KB/s on this download and the browser was still about 400 KB/s.

Maybe it was a coincidence that one of the clients died when the other one started? I'm currently testing some proxy changes from comment:63:ticket:21312, and it seems there is still a problem with proxies hanging up occasionally. Does it happen for you repeatedly?

Or, there could be some weird interaction, like maybe it only happens when both of the clients are using the same proxy instance.

Replying to dcf:

Thanks for this.

I tried it, but I wasn't able to reproduce it. On Debian, I used a browser from https://people.torproject.org/~dcf/pt-bundle/snowflake/20180321-8.0a4-4a5889af2891/ and a command line snowflake-client from 07291a0136.

Started a download of https://mirrors.edge.kernel.org/centos/7.4.1708/isos/x86_64/CentOS-7-x86_64-Everything-1708.iso. Was getting about 400 KB/s.

In the snowflake/client directory, ran tor -f torrc SOCKSPort 10000. It finished bootstrapping and the Browser download was still running.

Download through the command-line client: curl -O --proxy socks5h://127.0.0.1:10000/ https://mirrors.edge.kernel.org/centos/7.4.1708/isos/x86_64/CentOS-7-x86_64-Everything-1708.iso. I was getting about 500 KB/s on this download and the browser was still about 400 KB/s.

Maybe it was a coincidence that one of the clients died when the other one started? I'm currently testing some proxy changes from comment:63:ticket:21312, and it seems there is still a problem with proxies hanging up occasionally. Does it happen for you repeatedly?

Interesting, I had actually repeated the steps twice at different times (I left the download going for quiet some bit at one time, probably 7 min) and I always had the same result. I was experiencing repeated shortages but I thought that this may have been due to the other snowflake-client process that I had running. But the good news is that I can't reproduce it now which means you're right :)

FWIW the hangs still happen for me from time to time.

Trac:
Status: new to closed
Resolution: N/A to not a bug

I'm definitely noticing that after dozens of minutes of inactivity tor eventually hits:

Delaying directory fetches: No running bridges
Application request when we haven't received a consensus with exits. Optimistically trying known bridges again.

Restarting the affected program solves the issue. But I'm not sure if this is a consequence of the proxy changes that you were mentioning.

Replying to cypherpunks:

I'm definitely noticing that after dozens of minutes of inactivity tor eventually hits:

{{{ Delaying directory fetches: No running bridges Application request when we haven't received a consensus with exits. Optimistically trying known bridges again. }}}

Restarting the affected program solves the issue. But I'm not sure if this is a consequence of the proxy changes that you were mentioning.

I've seen this too, not only with snowflake. I suspect it's actually a tor bug. Sometimes a loading tab in Tor Browser will stall, and I'll see this message. Then, I open another tab and go to a domain that doesn't have an existing circuit, and it will "un-stall" the first tab. It's like tor sometimes starts believing that none of its bridges are working, and refuses to even try again, until you ask for a new circuit.

It may happen more often with snowflake because of #25429 (moved) (disconnects after 30 s). Some other tor-related tickets are #20379 (moved) (fixed in November 2016) and #24392 (moved) (fixed in November 2018). I'm not sure if #24392 (moved) is fixed in a released Tor Browser yet.

Replying to dcf:

I've seen this too, not only with snowflake. I suspect it's actually a tor bug. Sometimes a loading tab in Tor Browser will stall, and I'll see this message. Then, I open another tab and go to a domain that doesn't have an existing circuit, and it will "un-stall" the first tab. It's like tor sometimes starts believing that none of its bridges are working, and refuses to even try again, until you ask for a new circuit.

It may happen more often with snowflake because of #25429 (moved) (disconnects after 30 s). Some other tor-related tickets are #20379 (moved) (fixed in November 2016) and #24392 (moved) (fixed in November 2018). I'm not sure if #24392 (moved) is fixed in a released Tor Browser yet.

So I did a methodology to know for sure whether it's due to the switch to the newer snowflake-client or not (in which case it would primarily be an issue with tor, FWIW I'm using 0.3.3.x-a in all of this).

First, I switched everything back to the older snowflake-client (sha256sum ff94f6ed041263eb4e3bc8c215d9469a3a8cca7957ba4bccea617dc220294a74), including the one in my /usr/bin.
I restarted the two instances of tor that I had running, and I opened the Tor Browser and went to two sites then clicked on New Identity.
I stopped from any activity for 40 minutes.
I went back and first looked at the task manager and saw that the receiving/sending rates were varying around 0.9-4KiB/s - 2-5KiB/s with no dips and can go higher than that when snowflake rotates to another snowflake proxy (I don't have any other significant application running that could affect the network activity). Then, I opened check.torproject.org in the Tor Browser, it took a bit of time but it eventually worked. I did apt-get update (I'm torifying my apt-gets so it uses the debian-tor service which was setup to use the older snowflake-client) and it worked as well.

Afterwards I switched everything back to the newer snowflake-client (44ab0a493573b4f95a3b01f7495676cca6cd7ea2ae2afe1b69c4591ae4dbba2a), and I did the same step (2) above.

I stopped from any activity for 30 minutes.
Looking at the task manager one clearly sees that everything stopped working, And looking at wireshark one mostly sees STUN packets with Binding Request and Binding Success Response XOR-MAPPED-ADDRESS: ... Opening the console of the Tor Browser one sees Delaying directory fetches: No running bridges and opening up check.torproject.org eventually times out. Going to another site yields another timeout. Same for running apt-get update, and one sees Delaying directory fetches: No running bridges as well in the /var/log/syslog.

I hope you can reproduce this so that you can then narrow down the problem to a specific commit. (Note that in my testing I had 3 instances of tor running in total all of them with snowflake, but the issue may be present with just one running it's just that I didn't have time to test. Edit: tested that now and it does happen with only one instance running)

Edit: By the way this is reproducible with the latest fa2b6d0581df756f700c53d79ea40a141f6a96962b8de52c65d6817e6456af15 from TB 8.0a5.

#25347 (moved) might also be related.

Trac:

Since #25347 (moved) might be related, I'll wait until it's fixed and there's a new TB alpha release with it to see if solves the issue.

Edit: I built my own tor with asn's patch in #25347 (moved)#comment:15 (./autogen.sh && ./configure --disable-asciidoc && make, sha256sum 197af294a5abd93fcf9ed9f58f9093f1dfee0a6413bf445f478af306255a0091) but, alas, the problem still exists.

Edit: Reproducible with tor nightly 0.3.4.0-alpha-dev-20180404T175310Z.

Trac:
Status: closed to reopened
Resolution: not a bug to N/A

Update: Unfortunately it still happens (used the new domains that dcf made in #22782 (moved) to get it working),

Apr 21 08:39:23 debian Tor[1274]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Apr 21 08:39:23 debian Tor[1274]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for rendezvous desc)
Apr 21 08:39:23 debian Tor[1274]: Tried for 120 seconds to get a connection to [scrubbed]:80. Giving up. (waiting for circuit)
Apr 21 08:39:23 debian Tor[1274]: Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit)
Apr 21 08:39:25 debian Tor[1274]: Delaying directory fetches: No running bridges

Another new thing, I see a Tor has not observed any network activity notice when this happens (it doesn't appear always).

05:20.000 [notice] Tor has not observed any network activity for the past 74 seconds. Disabling circuit build timeout recording.
05:21.000 [notice] Delaying directory fetches: No running bridges
05:39.000 [notice] Application request when we haven't received a consensus with exits. Optimistically trying known bridges again.

Trac:
Summary: Running 2 instances of snowflake-client leads to the former one stopping to Tor eventually hits "No running bridges" with Snowflake

I was able to record detailed debug logs of tor when this happens,

https://privatebin.net/?55627cf4b879756c#E5s4afjMzPUnZ7tMMdWtvyKGdfdYbMF+jxyMPcyQtbw=

(I sent dcf the password from a throwaway guerrillamail since I don't have time to redact all the sensitive info from the logs)

Replying to cypherpunks:

I was able to record detailed debug logs of tor when this happens,

Thanks, I got it.

I have encountered this "No running bridges" problem too. I wonder if it happens when a snowflake proxy comes online but it behind a kind of NAT, or something, such that it can't actually do WebRTC. One way to debug this would be to activate the snowflake log (-log snowflake.log) and see what is the final proxy it connects to before "No running bridges". 37.218.242.151 is the fallback proxies; anything else is a wild proxy.

Replying to dcf:

I have encountered this "No running bridges" problem too. I wonder if it happens when a snowflake proxy comes online but it behind a kind of NAT, or something, such that it can't actually do WebRTC. I don't think that's the case since I almost always hit the fallback proxy (the 37... one) and this problem always happens when I use snowflake.

But I think I finally nailed it: When I followed your suggestion to use -log snowflake.log I no longer had that problem (it's been a couple of hours now of testing, hopefully there will be no other surprise). So I think there's definitely some issue in the patch for #25471 (moved). This also confirms my results in comment:6 where I didn't have this problem with the earlier build of snowflake.

Replying to cypherpunks:

Replying to dcf:

I have encountered this "No running bridges" problem too. I wonder if it happens when a snowflake proxy comes online but it behind a kind of NAT, or something, such that it can't actually do WebRTC. I don't think that's the case since I almost always hit the fallback proxy (the 37... one) and this problem always happens when I use snowflake.

But I think I finally nailed it: When I followed your suggestion to use -log snowflake.log I no longer had that problem (it's been a couple of hours now of testing, hopefully there will be no other surprise). So I think there's definitely some issue in the patch for #25471 (moved). This also confirms my results in comment:6 where I didn't have this problem with the earlier build of snowflake.

My guess: tor is not reading from the client transport plugin's stderr pipe. In the absence of the -log option, snowflake-client writes log messages to stderr. Eventually the pipe buffer fills up, and snowflake-client hangs while writing a log message.

The reason it didn't happen in earlier versions of snowflake-client is that earlier versions always logged to a file by default. 12922a232b made it log to stderr by default and added a -log option. The change would have happened in Tor Browser 8.0a5, because that's when #21312 (moved) caused an upgrade to the version of snowflake-client that doesn't log by default.

Why this affects snowflake and not other PTs: probably just because snowflake-client writes more to the log (because of #25429 (moved) and other causes). Other PTs are probably affected; they just don't fill up the pipe buffer as fast.

Could you try this patch for snowflake-client?

--- a/client/snowflake.go
+++ b/client/snowflake.go
@@ -150,6 +150,8 @@ func main() {
                }
                defer logFile.Close()
                log.SetOutput(logFile)
+       } else {
+               log.SetOutput(ioutil.Discard)
        }
 
        log.Println("\n\n\n --- Starting Snowflake Client ---")

Trac:
Status: reopened to needs_review

Replying to dcf:

Could you try this patch for snowflake-client? I can't for now but I'll definitely inform you when I test the next alpha release when this gets merged :) (By the way many thanks for investigating the issue!)

Trac:
0001-Log-to-io.Discard-if-no-log-file-is-set.patch

The hunch and patch seem reasonable to me.

Trac:
Status: needs_review to merge_ready

I tested the patch myself, copying the patched snowflake-client into a Tor Browser 8.0a8.

The first time I tried, I got "No running bridges" after 11 minutes :/

06/11/18 06:07:56.600 [NOTICE] DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections. 
06/11/18 06:07:56.600 [NOTICE] Opening Socks listener on 127.0.0.1:9150 
06/11/18 06:07:57.500 [NOTICE] Bootstrapped 80%: Connecting to the Tor network 
06/11/18 06:07:57.500 [NOTICE] Bootstrapped 85%: Finishing handshake with first hop 
06/11/18 06:07:59.700 [NOTICE] Bootstrapped 90%: Establishing a Tor circuit 
06/11/18 06:08:00.600 [NOTICE] new bridge descriptor 'flakey' (fresh): $2B280B23E1107BB62ABFC40DDCC8824814F80A72~flakey at 0.0.3.0 
06/11/18 06:08:01.200 [WARN] Your Guard $2B280B23E1107BB62ABFC40DDCC8824814F80A72 ($2B280B23E1107BB62ABFC40DDCC8824814F80A72) is failing an extremely large amount of circuits. This could indicate a route manipulation attack, extreme network overload, or a bug. Success counts are 56/251. Use counts are 44/44. 247 circuits completed, 0 were unusable, 191 collapsed, and 2 timed out. For reference, your timeout cutoff is 60 seconds. 
06/11/18 06:08:01.800 [NOTICE] Tor has successfully opened a circuit. Looks like client functionality is working. 
06/11/18 06:08:01.800 [NOTICE] Bootstrapped 100%: Done 
06/11/18 06:08:03.900 [NOTICE] New control connection opened from 127.0.0.1. 
06/11/18 06:08:04.100 [NOTICE] New control connection opened from 127.0.0.1. 
06/11/18 06:18:37.900 [WARN] Your Guard $2B280B23E1107BB62ABFC40DDCC8824814F80A72 ($2B280B23E1107BB62ABFC40DDCC8824814F80A72) is failing a very large amount of circuits. Most likely this means the Tor network is overloaded, but it could also mean an attack against you or potentially the guard itself. Success counts are 87/288. Use counts are 48/48. 284 circuits completed, 0 were unusable, 197 collapsed, and 2 timed out. For reference, your timeout cutoff is 60 seconds. 
06/11/18 06:18:59.400 [NOTICE] Delaying directory fetches: No running bridges 
06/11/18 06:18:59.400 [NOTICE] Application request when we haven't received a consensus with exits. Optimistically trying known bridges again. 
06/11/18 06:24:02.100 [NOTICE] Delaying directory fetches: No running bridges 
06/11/18 06:24:22.200 [NOTICE] Application request when we haven't received a consensus with exits. Optimistically trying known bridges again.

But I tried again, left it running overnight, restarted the next day, and didn't have another problem. So I'm going to go ahead with the theory that the first failure was a fluke or is some other bug.

Anyway, I merged the patch in 6077141f4affdab9b7ce97a9b1c6859825eaaa29.

Trac:
Resolution: N/A to fixed
Status: merge_ready to closed

#26360 (moved) is for tor not reading from transport plugins' stderr.

Replying to dcf:

The first time I tried, I got "No running bridges" after 11 minutes :/ There's still an unfixed little-t-tor bug that leads to that, so that may be it.

Will there be a "Bump snowflake/go-webrtc again for trac 25600" ticket for getting this fix into TB?

Replying to cypherpunks:

Will there be a "Bump snowflake/go-webrtc again for trac 25600" ticket for getting this fix into TB?

#26795 (moved)

closed

mentioned in issue #26360 (moved)

mentioned in issue #26795 (moved)

mentioned in issue tpo/core/tor#26360 (closed)

moved to tpo/anti-censorship/pluggable-transports/snowflake#25600 (closed)

Tor eventually hits "No running bridges" with Snowflake

Child items ...

Activity