While bootstrapping, Tor clients stall for 60s when obfsproxy bridges are used.

changed milestone to %Tor: 0.2.5.x-final

added 024-backport 025-triaged 60s andrea-review-0254 component::core tor/tor consensus flashproxy milestone::Tor: 0.2.5.x-final obfsproxy owner::asn priority::medium resolution::fixed stall status::closed tbb-needs tbb-tor-backported-3.6b1 type::defect labels

Trac:
debug-obfs3.log

Trac:
debug-vanilla.log

Trac:
Cc: asn to asn, karsten

Hm, I couldn't find the root cause after a quick look.

Philipp noticed that in vanilla Tor, there are two calls to connection_tls_start_handshake() whereas in obfstor there is only one.

Also, what's up with

Jul 08 16:30:33.779 [Debug] conn_close_if_marked(): Cleaning up connection (fd -1).
Jul 08 16:30:33.780 [Debug] connection_remove(): removing socket -1 (type Directory), n_conns now 5
Jul 08 16:30:33.780 [Info] connection_free_(): Freeing linked Directory connection [client finished] with 0 bytes on inbuf, 0 on outbuf.

right before the idle time starts?

(socket 9 (that wants to read all the time) seems to be a control connection btw)

The 60 seconds makes me wonder whether one of the things that we periodically check could be to blame -- LAZY_DESCRIPTOR_RETRY_INTERVAL, maybe?

Calling this a Tor issue.

Trac:
Component: Obfsproxy to Tor
Milestone: N/A to Tor: 0.2.5.x-final

Philipp, I cannot reproduce this in my workstation

Does it always happen to you? Can you remind me if it also happens when you have cached descriptors? Any other info?

Trac:
Status: new to needs_information

Looks like this problem also happens on normal bridges. See #9557 (moved) which I just closed as dup, because this is very likely the same problem.

Does this problem go away if you take LAZY_DESCRIPTOR_RETRY_INTERVAL down to (say) 10 seconds?

How does one reproduce it? I just started a new tor with an empty datadir and a randomly chosen bridge, and it worked fine.

Replying to nickm:

Does this problem go away if you take LAZY_DESCRIPTOR_RETRY_INTERVAL down to (say) 10 seconds?

Still 1 minute for me.

How does one reproduce it? I just started a new tor with an empty datadir and a randomly chosen bridge, and it worked fine.

Interesting. Can you retry after removing cached-* and leaving state in place?

Replying to karsten:

Looks like this problem also happens on normal bridges. See #9557 (moved) which I just closed as dup, because this is very likely the same problem.

FWIW, I seem to have the opposite problem. My tor (with obfsproxy) bootstraps fine only the first time. Subsequent bootstraps then lead to the 60-second-problem.

Also, I can't reproduce #9557 (moved). I just tested this with tor v0.2.4.16-rc (git-dcf6b6d7dda9ffbd) using:

tor --datadirectory /tmp/tmptor --bridge x.x.x.x:443 --controlport auto --socksport auto

This setup worked fine for me.

Replying to asn:

Philipp, I cannot reproduce this in my workstation

Does it always happen to you? Can you remind me if it also happens when you have cached descriptors? Any other info?

Sorry for taking so long to respond.

Yes, it seems to always happen.

I just tested it again with the newest PTTBB. When starting it for the very first time (i.e., with an empty data directory), it bootstraps as expected. Subsequent times stall for 60 seconds. My system is a 64-bit Debian wheezy. The PTTBB is unconfigured except that I disabled flash proxies but I remember dcf mentioning that he experienced this with flash proxies as well. And yes, this also happens with cached descriptors, i.e., the file cached-descriptors.new being non-empty.

Replying to nickm:

How does one reproduce it? I just started a new tor with an empty datadir and a randomly chosen bridge, and it worked fine.

Please see my previous comment. First, I only have problems in combination with obfsproxy. In contrast to what karsten is experiencing, vanilla bridges work fine for me. Second, emtpy datadirs also work fine. The problem only occurs with non-empty datadirs.

Huh, I wonder why nobody can reproduce the problem I'm seeing. Describing it in more detail. Here's my torrc (note the UseBridges 1 part which phw doesn't seem to use, and maybe rightly so):

DataDirectory /home/ubuntu/src/client
SocksPort 9000
UseBridges 1
Bridge [scrubbed]:443
Log debug file /home/ubuntu/src/client/debug.log
Log info file /home/ubuntu/src/client/info.log
Log notice file /home/ubuntu/src/client/notice.log
LogTimeGranularity 1 msec
SafeLogging 0

[scrubbed] is a bridge I just got from bridges.tpo. I start tor (master, 7121e7b) using:

/home/ubuntu/src/tor-master/src/or/tor -f /home/ubuntu/src/client/torrc

Bootstraps to 100% just fine. Then I rm cached-* (but not state) and re-run:

/home/ubuntu/src/tor-master/src/or/tor -f /home/ubuntu/src/client/torrc

Takes a minute to download the microdesc consensus.

phw, can you try again using a setup closer to mine?

Replying to karsten:

Huh, I wonder why nobody can reproduce the problem I'm seeing. Describing it in more detail.

Alright, I can now reproduce; even with v0.2.4.16-rc (git-dcf6b6d7dda9ffbd). I missed that you only deleted cached-* files. Here's the log I get with your config. One can see the 60-second-wait.

Aug 22 11:40:46.442 [notice] Parsing GEOIP IPv6 file /usr/share/tor/geoip6.
Aug 22 11:40:46.486 [notice] We were built to run on a 64-bit CPU, with OpenSSL 1.0.1 or later, but with a version of OpenSSL that apparently lacks accelerated support for the NIST P-224 and P-256 groups. Building openssl with such support (using the enable-ec_nistp_64_gcc_128 option when configuring it) would make ECDH much faster.
Aug 22 11:40:47.506 [notice] Bootstrapped 5%: Connecting to directory server.
Aug 22 11:40:47.639 [notice] Bootstrapped 10%: Finishing handshake with directory server.
Aug 22 11:40:48.048 [notice] Learned fingerprint [scrubbed].
Aug 22 11:40:48.048 [notice] Bootstrapped 15%: Establishing an encrypted directory connection.
Aug 22 11:40:48.353 [notice] Bootstrapped 20%: Asking for networkstatus consensus.
Aug 22 11:40:48.487 [notice] Bootstrapped 50%: Loading relay descriptors.
Aug 22 11:40:48.755 [notice] new bridge descriptor [scrubbed].
Aug 22 11:40:48.755 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.
Aug 22 11:41:50.359 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.
Aug 22 11:41:50.891 [notice] I learned some more directory information, but not enough to build a circuit: We need more microdescriptors: we have 0/4314, and can only build 0% of likely paths. (We have 0% of guards bw, 0% of midpoint bw, and 0% of exit bw.)
Aug 22 11:41:52.506 [notice] We now have enough directory information to build circuits.
Aug 22 11:41:52.506 [notice] Bootstrapped 80%: Connecting to the Tor network.
Aug 22 11:41:52.511 [notice] Bootstrapped 90%: Establishing a Tor circuit.
Aug 22 11:41:53.921 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
Aug 22 11:41:53.921 [notice] Bootstrapped 100%: Done.

Interesting. I can now reproduce that, given the "remove cached-* but not state" instruction.

If I remove the cached-* files and also remove the EntryGuard* entries from the state file, that is sufficient to make bootstrapping fast.

Changing networkstatus_dl_check_interval() in main.c to a lower value seems to resolve this for me. (Please confirm?) Before we do that in production, though, we need to check whether there are any negative effects (e.g., runaway download attempts on failure) from doing so.

Trac:
Status: needs_information to new

Replying to nickm:

Changing networkstatus_dl_check_interval() in main.c to a lower value seems to resolve this for me. (Please confirm?) Before we do that in production, though, we need to check whether there are any negative effects (e.g., runaway download attempts on failure) from doing so.

It seems to work for me. I lowered the value to 20 and could reduce the waiting time in both cases (karsten's vanilla bridges without cached-* files and my obfsproxy scenario) to 20 seconds.

So, reducing networkstatus_dl_check_interval() in the mainline code still makes me nervous. I see two possible ways forward here:

We could try very hard to make sure that there isn't a runaway-downloads bug that we'd be exacerbating by doing the check more frequently.
We could figure out what event (in this case) we should be triggering a download check in response to, and trigger a download check in response to that.

Trac:
Keywords: N/A deleted, flashproxy added

Trac:
Cc: asn, karsten to asn, karsten, mcs, brade

Trac:
Keywords: N/A deleted, tbb-needs added

Replying to nickm:

So, reducing networkstatus_dl_check_interval() in the mainline code still makes me nervous. I see two possible ways forward here: ... 2. We could figure out what event (in this case) we should be triggering a download check in response to, and trigger a download check in response to that.

This is the safer approach.

I tried provoking the bug again, but adding a log message every time we are about to call UPDATE_NETWORKSTATUS_DOWNLOADS from main.c. I saw:

Mar 07 15:21:49.000 [notice] Bootstrapped 5%: Connecting to directory server.
Mar 07 15:21:49.000 [notice] UPDATE_NETWORKSTATUS_DOWNLOADS
Mar 07 15:21:49.000 [notice] Bootstrapped 10%: Finishing handshake with directory server.
Mar 07 15:21:49.000 [notice] Bootstrapped 15%: Establishing an encrypted directory connection.
Mar 07 15:21:49.000 [notice] Bootstrapped 20%: Asking for networkstatus consensus.
Mar 07 15:21:50.000 [notice] new bridge descriptor 'Unnamed' (fresh): $38FC956E026A600F4F66EA559D9CAAC4365E2C93~Unnamed at 192.36.31.159
Mar 07 15:21:50.000 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.
Mar 07 15:21:50.000 [notice] new bridge descriptor 'braeu5' (fresh): $7683C2EA0896D5C3859CD594D691EE09831C1E8A~braeu5 at 90.225.64.7
Mar 07 15:21:50.000 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.
Mar 07 15:21:50.000 [notice] new bridge descriptor 'Unnamed' (fresh): $476136C72FF5C937E97E95A7CFA5A44424237B5F~Unnamed at 188.165.251.124
Mar 07 15:21:50.000 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.

 (long delay here)

Mar 07 15:22:50.000 [notice] UPDATE_NETWORKSTATUS_DOWNLOADS
Mar 07 15:22:50.000 [notice] Bootstrapped 25%: Loading networkstatus consensus.
Mar 07 15:22:53.000 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.
Mar 07 15:22:53.000 [notice] Bootstrapped 40%: Loading authority key certs.

So there's a clue that we should be calling update_networkstatus_downloads() in response to getting new bridge descriptors if we don't have a fresh networkstatus.

=== Broken thing 1:

In learned_bridge_descriptor(), when I reproduce the bug, the body of this if-statement is never executed:

      if (first) {
        routerlist_retry_directory_downloads(now);
      }

Interesting! Apparently, by the time we hit learned_bridge_descriptor(), any_bridge_descriptors_known() is always true. At least, for me.

The function routerlist_retry_directory_downloads() seems pretty heavy-weight for the general case we're concerned about here too. In my testing, update_networkstatus_downloads() was sufficient. Writing a patch...

Branch "bug9229_024" in my public repository has an alleged fix suitable for merge into 0.2.4 IMO, assuming it tests out okay. (It works for me.)

A more thorough fix might figure out why any_bridge_descriptor_known() is returning true in this case, and figure out a better way to fix that. I have a fix like that in "bug9229_025".

Please review and test?

Trac:
Keywords: tbb-needs deleted, tbb-needs 024-backport added
Status: new to needs_review

Trac:
Keywords: tbb-needs 024-backport deleted, tbb-needs 024-backport 025-triaged added

Replying to nickm:

Branch "bug9229_024" in my public repository has an alleged fix suitable for merge into 0.2.4 IMO, assuming it tests out okay. (It works for me.)

Are we sure we don't want an update_all_descriptor_downloads() call too?

I'm thinking of a case where we have an adequate cached consensus, and fetching the bridge descriptor should trigger us to fetch our microdescriptors now.

(That said, I think your fix is an improvement over the current 0.2.4 situation, so it depends how far we want to go.)

FWIW, I have tested bug9229_025 in two boxes so far, and it worked fine.

13:23 < asn> nickm: i still don't understand why any_bridge_descriptors_known() 
             doesn't do the trick in bug9229_025 
13:24 < asn> but I think that the patch can't make the situation worse
13:24 < nickm> hm. num_bridges_usable in that branch is totally misnamed
13:26 < nickm> also n_options_out is misimplemented and useless

Trac:
Status: needs_review to needs_revision

So, I fixed two major bugs in the patch with 6ab10a546663b509d189ce16645b11af0e83d41c.

The reason not to use any_bridge_descriptor_known() is that, by the time we were checking it, we had one good bridge descriptor, and so it was returning false, and so we couldn't actually notice that we had just become live.

Previous testing should probably be considered useless; can anybody test it now?

Trac:
Status: needs_revision to needs_review

Drop owners from needs_review tickets in tor 0.2.5

Trac:
Keywords: tbb-needs 024-backport 025-triaged deleted, 024-backport, tbb-needs, andrea-review-0254, 025-triaged added

The bug9229_024 branch looks fine to me; now looking at bug9229_025.

Code review for nickm/bug9229_025 branch:

7aa21920481fc4f0bdb159aeedf11582aa95dd01:
- Seems less than intuitive to find a function called choose_random_entry_impl() which actually either does that or counts the number of entries. Is this just because that was the only place we happened to already have code to construct the list of bridges?
- The stuff you fixed in 6ab10a546663b509d189ce16645b11af0e83d41c
6ab10a546663b509d189ce16645b11af0e83d41c
- Looks fine, good idea to make things work :)

nickm/bug9229_024 was backported to TBB 3.6beta1 (https://gitweb.torproject.org/builders/tor-browser-bundle.git/blob/HEAD:/gitian/patches/bug9229.patch). Right now, we are not intending on backporting the 0.2.5.x version, unless someone suggests otherwise.

Trac:
Keywords: N/A deleted, tbb-tor-backported-3.6b1 added

Replying to andrea:

Code review for nickm/bug9229_025 branch:

7aa21920481fc4f0bdb159aeedf11582aa95dd01:

Seems less than intuitive to find a function called choose_random_entry_impl() which actually either does that or counts the number of entries. Is this just because that was the only place we happened to already have code to construct the list of bridges?

Right; also, because having the code to construct the list in two places was would be begging for them to fall out of sync.

Merging the 0.2.5 one to 0.2.5 and the 0.2.4 one to 0.2.4.

Trac:
Status: needs_review to closed
Resolution: N/A to fixed

closed

mentioned in issue #9557 (moved)

mentioned in issue #9922 (moved)

mentioned in issue #10082 (closed)

mentioned in issue #10418 (moved)

mentioned in issue #11156 (moved)

mentioned in issue #11183 (moved)

mentioned in issue #11612 (moved)

mentioned in issue #11879 (moved)

mentioned in issue #11965 (moved)

mentioned in issue #15713 (moved)

While bootstrapping, Tor clients stall for 60s when obfsproxy bridges are used.

Child items 0

Activity