Opened 4 years ago

Closed 4 years ago

#9229 closed defect (fixed)

While bootstrapping, Tor clients stall for 60s when obfsproxy bridges are used.

Reported by: phw Owned by: asn
Priority: Medium Milestone: Tor: 0.2.5.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: 60s, consensus, stall, obfsproxy, flashproxy, tbb-needs, 024-backport, 025-triaged, andrea-review-0254, tbb-tor-backported-3.6b1
Cc: asn, karsten, mcs, brade Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

When Tor bootstraps using obfsproxy, it stalls for 60 seconds. The last three log messages before that happening are:

[notice] Bootstrapped 50%: Loading relay descriptors.
[notice] new bridge descriptor 'bzoum' (fresh): $2ADFE7AA8D272C520D1FBFBF4E413F3A1B26313D~bzoum at [redacted]
[notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.

After 60 seconds, it continues to bootstrap to 100%. This was observed for all pluggable transports: flash proxies, obfs{2,3} and ScrambleSuit. Vanilla bridges do not seem to be affected by this. Also, the very first bootstrap (i.e., with an empty data directory) also seems to be unaffected by this.

I attached two files containing debug logs (after the second bootstrap) of obfs3 and vanilla bridges. I removed the timestamps to make it easy to compare both of them.

I tested this with Tor v0.2.4.15-rc and obfsproxy da31ffc638a2727df1679bc888c2a8c1c49809a9.

Child Tickets

Attachments (2)

debug-obfs3.log (38.1 KB) - added by phw 4 years ago.
debug-vanilla.log (689.3 KB) - added by phw 4 years ago.

Download all attachments as: .zip

Change History (36)

Changed 4 years ago by phw

Attachment: debug-obfs3.log added

Changed 4 years ago by phw

Attachment: debug-vanilla.log added

comment:1 Changed 4 years ago by karsten

Cc: karsten added

comment:2 Changed 4 years ago by asn

Hm, I couldn't find the root cause after a quick look.

Philipp noticed that in vanilla Tor, there are two calls to connection_tls_start_handshake() whereas in obfstor there is only one.

Also, what's up with

Jul 08 16:30:33.779 [Debug] conn_close_if_marked(): Cleaning up connection (fd -1).
Jul 08 16:30:33.780 [Debug] connection_remove(): removing socket -1 (type Directory), n_conns now 5
Jul 08 16:30:33.780 [Info] connection_free_(): Freeing linked Directory connection [client finished] with 0 bytes on inbuf, 0 on outbuf.

right before the idle time starts?

(socket 9 (that wants to read all the time) seems to be a control connection btw)

comment:3 Changed 4 years ago by nickm

The 60 seconds makes me wonder whether one of the things that we periodically check could be to blame -- LAZY_DESCRIPTOR_RETRY_INTERVAL, maybe?

comment:4 Changed 4 years ago by nickm

Component: ObfsproxyTor
Milestone: Tor: 0.2.5.x-final

Calling this a Tor issue.

comment:5 Changed 4 years ago by asn

Status: newneeds_information

Philipp, I cannot reproduce this in my workstation

Does it always happen to you? Can you remind me if it also happens when you have cached descriptors? Any other info?

comment:6 Changed 4 years ago by karsten

Looks like this problem also happens on normal bridges. See #9557 which I just closed as dup, because this is very likely the same problem.

comment:7 Changed 4 years ago by nickm

Does this problem go away if you take LAZY_DESCRIPTOR_RETRY_INTERVAL down to (say) 10 seconds?

How does one reproduce it? I just started a new tor with an empty datadir and a randomly chosen bridge, and it worked fine.

comment:8 in reply to:  7 Changed 4 years ago by karsten

Replying to nickm:

Does this problem go away if you take LAZY_DESCRIPTOR_RETRY_INTERVAL down to (say) 10 seconds?

Still 1 minute for me.

How does one reproduce it? I just started a new tor with an empty datadir and a randomly chosen bridge, and it worked fine.

Interesting. Can you retry after removing cached-* and leaving state in place?

comment:9 in reply to:  6 Changed 4 years ago by phw

Replying to karsten:

Looks like this problem also happens on normal bridges. See #9557 which I just closed as dup, because this is very likely the same problem.

FWIW, I seem to have the opposite problem. My tor (with obfsproxy) bootstraps fine only the first time. Subsequent bootstraps then lead to the 60-second-problem.

Also, I can't reproduce #9557. I just tested this with tor v0.2.4.16-rc (git-dcf6b6d7dda9ffbd) using:

tor --datadirectory /tmp/tmptor --bridge x.x.x.x:443 --controlport auto --socksport auto

This setup worked fine for me.

comment:10 in reply to:  5 Changed 4 years ago by phw

Replying to asn:

Philipp, I cannot reproduce this in my workstation

Does it always happen to you? Can you remind me if it also happens when you have cached descriptors? Any other info?

Sorry for taking so long to respond.

Yes, it seems to always happen.

I just tested it again with the newest PTTBB. When starting it for the very first time (i.e., with an empty data directory), it bootstraps as expected. Subsequent times stall for 60 seconds. My system is a 64-bit Debian wheezy. The PTTBB is unconfigured except that I disabled flash proxies but I remember dcf mentioning that he experienced this with flash proxies as well. And yes, this also happens with cached descriptors, i.e., the file cached-descriptors.new being non-empty.

comment:11 in reply to:  7 Changed 4 years ago by phw

Replying to nickm:

How does one reproduce it? I just started a new tor with an empty datadir and a randomly chosen bridge, and it worked fine.

Please see my previous comment. First, I only have problems in combination with obfsproxy. In contrast to what karsten is experiencing, vanilla bridges work fine for me. Second, emtpy datadirs also work fine. The problem only occurs with non-empty datadirs.

comment:12 Changed 4 years ago by karsten

Huh, I wonder why nobody can reproduce the problem I'm seeing. Describing it in more detail. Here's my torrc (note the UseBridges 1 part which phw doesn't seem to use, and maybe rightly so):

DataDirectory /home/ubuntu/src/client
SocksPort 9000
UseBridges 1
Bridge [scrubbed]:443
Log debug file /home/ubuntu/src/client/debug.log
Log info file /home/ubuntu/src/client/info.log
Log notice file /home/ubuntu/src/client/notice.log
LogTimeGranularity 1 msec
SafeLogging 0

[scrubbed] is a bridge I just got from bridges.tpo. I start tor (master, 7121e7b) using:

/home/ubuntu/src/tor-master/src/or/tor -f /home/ubuntu/src/client/torrc

Bootstraps to 100% just fine. Then I rm cached-* (but not state) and re-run:

/home/ubuntu/src/tor-master/src/or/tor -f /home/ubuntu/src/client/torrc

Takes a minute to download the microdesc consensus.

phw, can you try again using a setup closer to mine?

comment:13 in reply to:  12 Changed 4 years ago by phw

Replying to karsten:

Huh, I wonder why nobody can reproduce the problem I'm seeing. Describing it in more detail.

Alright, I can now reproduce; even with v0.2.4.16-rc (git-dcf6b6d7dda9ffbd). I missed that you only deleted cached-* files. Here's the log I get with your config. One can see the 60-second-wait.

Aug 22 11:40:46.442 [notice] Parsing GEOIP IPv6 file /usr/share/tor/geoip6.
Aug 22 11:40:46.486 [notice] We were built to run on a 64-bit CPU, with OpenSSL 1.0.1 or later, but with a version of OpenSSL that apparently lacks accelerated support for the NIST P-224 and P-256 groups. Building openssl with such support (using the enable-ec_nistp_64_gcc_128 option when configuring it) would make ECDH much faster.
Aug 22 11:40:47.506 [notice] Bootstrapped 5%: Connecting to directory server.
Aug 22 11:40:47.639 [notice] Bootstrapped 10%: Finishing handshake with directory server.
Aug 22 11:40:48.048 [notice] Learned fingerprint [scrubbed].
Aug 22 11:40:48.048 [notice] Bootstrapped 15%: Establishing an encrypted directory connection.
Aug 22 11:40:48.353 [notice] Bootstrapped 20%: Asking for networkstatus consensus.
Aug 22 11:40:48.487 [notice] Bootstrapped 50%: Loading relay descriptors.
Aug 22 11:40:48.755 [notice] new bridge descriptor [scrubbed].
Aug 22 11:40:48.755 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.
Aug 22 11:41:50.359 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.
Aug 22 11:41:50.891 [notice] I learned some more directory information, but not enough to build a circuit: We need more microdescriptors: we have 0/4314, and can only build 0% of likely paths. (We have 0% of guards bw, 0% of midpoint bw, and 0% of exit bw.)
Aug 22 11:41:52.506 [notice] We now have enough directory information to build circuits.
Aug 22 11:41:52.506 [notice] Bootstrapped 80%: Connecting to the Tor network.
Aug 22 11:41:52.511 [notice] Bootstrapped 90%: Establishing a Tor circuit.
Aug 22 11:41:53.921 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
Aug 22 11:41:53.921 [notice] Bootstrapped 100%: Done.

comment:14 Changed 4 years ago by nickm

Interesting. I can now reproduce that, given the "remove cached-* but not state" instruction.

If I remove the cached-* files and also remove the EntryGuard* entries from the state file, that *is* sufficient to make bootstrapping fast.

comment:15 Changed 4 years ago by nickm

Status: needs_informationnew

Changing networkstatus_dl_check_interval() in main.c to a lower value seems to resolve this for me. (Please confirm?) Before we do that in production, though, we need to check whether there are any negative effects (e.g., runaway download attempts on failure) from doing so.

comment:16 in reply to:  15 Changed 4 years ago by phw

Replying to nickm:

Changing networkstatus_dl_check_interval() in main.c to a lower value seems to resolve this for me. (Please confirm?) Before we do that in production, though, we need to check whether there are any negative effects (e.g., runaway download attempts on failure) from doing so.

It seems to work for me. I lowered the value to 20 and could reduce the waiting time in both cases (karsten's vanilla bridges without cached-* files and my obfsproxy scenario) to 20 seconds.

comment:17 Changed 4 years ago by nickm

So, reducing networkstatus_dl_check_interval() in the mainline code still makes me nervous. I see two possible ways forward here:

  1. We could try very hard to make sure that there isn't a runaway-downloads bug that we'd be exacerbating by doing the check more frequently.
  2. We could figure out what event (in this case) we should be triggering a download check in response to, and trigger a download check in response to that.

comment:18 Changed 4 years ago by dcf

Keywords: flashproxy added

comment:19 Changed 4 years ago by brade

Cc: mcs brade added

comment:20 Changed 4 years ago by mikeperry

Keywords: tbb-needs added

comment:21 in reply to:  17 Changed 4 years ago by nickm

Replying to nickm:

So, reducing networkstatus_dl_check_interval() in the mainline code still makes me nervous. I see two possible ways forward here:

...

  1. We could figure out what event (in this case) we should be triggering a download check in response to, and trigger a download check in response to that.

This is the safer approach.

I tried provoking the bug again, but adding a log message every time we are about to call UPDATE_NETWORKSTATUS_DOWNLOADS from main.c. I saw:

Mar 07 15:21:49.000 [notice] Bootstrapped 5%: Connecting to directory server.
Mar 07 15:21:49.000 [notice] UPDATE_NETWORKSTATUS_DOWNLOADS
Mar 07 15:21:49.000 [notice] Bootstrapped 10%: Finishing handshake with directory server.
Mar 07 15:21:49.000 [notice] Bootstrapped 15%: Establishing an encrypted directory connection.
Mar 07 15:21:49.000 [notice] Bootstrapped 20%: Asking for networkstatus consensus.
Mar 07 15:21:50.000 [notice] new bridge descriptor 'Unnamed' (fresh): $38FC956E026A600F4F66EA559D9CAAC4365E2C93~Unnamed at 192.36.31.159
Mar 07 15:21:50.000 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.
Mar 07 15:21:50.000 [notice] new bridge descriptor 'braeu5' (fresh): $7683C2EA0896D5C3859CD594D691EE09831C1E8A~braeu5 at 90.225.64.7
Mar 07 15:21:50.000 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.
Mar 07 15:21:50.000 [notice] new bridge descriptor 'Unnamed' (fresh): $476136C72FF5C937E97E95A7CFA5A44424237B5F~Unnamed at 188.165.251.124
Mar 07 15:21:50.000 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.

 (long delay here)

Mar 07 15:22:50.000 [notice] UPDATE_NETWORKSTATUS_DOWNLOADS
Mar 07 15:22:50.000 [notice] Bootstrapped 25%: Loading networkstatus consensus.
Mar 07 15:22:53.000 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.
Mar 07 15:22:53.000 [notice] Bootstrapped 40%: Loading authority key certs.

So there's a clue that we should be calling update_networkstatus_downloads() in response to getting new bridge descriptors if we don't have a fresh networkstatus.

comment:22 Changed 4 years ago by nickm

Broken thing 1:

In learned_bridge_descriptor(), when I reproduce the bug, the body of this if-statement is never executed:

      if (first) {
        routerlist_retry_directory_downloads(now);
      }

Interesting! Apparently, by the time we hit learned_bridge_descriptor(), any_bridge_descriptors_known() is always true. At least, for me.

The function routerlist_retry_directory_downloads() seems pretty heavy-weight for the general case we're concerned about here too. In my testing, update_networkstatus_downloads() was sufficient. Writing a patch...

comment:23 Changed 4 years ago by nickm

Keywords: 024-backport added
Status: newneeds_review

Branch "bug9229_024" in my public repository has an alleged fix suitable for merge into 0.2.4 IMO, assuming it tests out okay. (It works for me.)

A more thorough fix might figure out why any_bridge_descriptor_known() is returning true in this case, and figure out a better way to fix that. I have a fix like that in "bug9229_025".

Please review and test?

comment:24 Changed 4 years ago by nickm

Keywords: 025-triaged added

comment:25 in reply to:  23 Changed 4 years ago by arma

Replying to nickm:

Branch "bug9229_024" in my public repository has an alleged fix suitable for merge into 0.2.4 IMO, assuming it tests out okay. (It works for me.)

Are we sure we don't want an update_all_descriptor_downloads() call too?

I'm thinking of a case where we have an adequate cached consensus, and fetching the bridge descriptor should trigger us to fetch our microdescriptors now.

(That said, I think your fix is an improvement over the current 0.2.4 situation, so it depends how far we want to go.)

comment:26 Changed 4 years ago by asn

FWIW, I have tested bug9229_025 in two boxes so far, and it worked fine.

comment:27 Changed 4 years ago by nickm

Status: needs_reviewneeds_revision
13:23 < asn> nickm: i still don't understand why any_bridge_descriptors_known() 
             doesn't do the trick in bug9229_025 
13:24 < asn> but I think that the patch can't make the situation worse
13:24 < nickm> hm. num_bridges_usable in that branch is totally misnamed
13:26 < nickm> also n_options_out is misimplemented and useless

comment:28 Changed 4 years ago by nickm

Status: needs_revisionneeds_review

So, I fixed two major bugs in the patch with 6ab10a546663b509d189ce16645b11af0e83d41c.

The reason not to use any_bridge_descriptor_known() is that, by the time we were checking it, we had _one_ good bridge descriptor, and so it was returning false, and so we couldn't actually notice that we had just become live.

Previous testing should probably be considered useless; can anybody test it now?

comment:29 Changed 4 years ago by nickm

Keywords: andrea-review-0254 added

Drop owners from needs_review tickets in tor 0.2.5

comment:30 Changed 4 years ago by andrea

The bug9229_024 branch looks fine to me; now looking at bug9229_025.

comment:31 Changed 4 years ago by andrea

Code review for nickm/bug9229_025 branch:

  • 7aa21920481fc4f0bdb159aeedf11582aa95dd01:
    • Seems less than intuitive to find a function called choose_random_entry_impl() which actually either does that or counts the number of entries. Is this just because that was the only place we happened to already have code to construct the list of bridges?
    • The stuff you fixed in 6ab10a546663b509d189ce16645b11af0e83d41c
  • 6ab10a546663b509d189ce16645b11af0e83d41c
    • Looks fine, good idea to make things work :)

comment:32 Changed 4 years ago by mikeperry

Keywords: tbb-tor-backported-3.6b1 added

nickm/bug9229_024 was backported to TBB 3.6beta1 (https://gitweb.torproject.org/builders/tor-browser-bundle.git/blob/HEAD:/gitian/patches/bug9229.patch). Right now, we are not intending on backporting the 0.2.5.x version, unless someone suggests otherwise.

comment:33 in reply to:  31 Changed 4 years ago by nickm

Replying to andrea:

Code review for nickm/bug9229_025 branch:

  • 7aa21920481fc4f0bdb159aeedf11582aa95dd01:
    • Seems less than intuitive to find a function called choose_random_entry_impl() which actually either does that or counts the number of entries. Is this just because that was the only place we happened to already have code to construct the list of bridges?

Right; also, because having the code to construct the list in two places was would be begging for them to fall out of sync.

Merging the 0.2.5 one to 0.2.5 and the 0.2.4 one to 0.2.4.

comment:34 Changed 4 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed
Note: See TracTickets for help on using tickets.