When Tor bootstraps using obfsproxy, it stalls for 60 seconds. The last three log messages before that happening are:
[notice] Bootstrapped 50%: Loading relay descriptors.[notice] new bridge descriptor 'bzoum' (fresh): $2ADFE7AA8D272C520D1FBFBF4E413F3A1B26313D~bzoum at [redacted][notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.
After 60 seconds, it continues to bootstrap to 100%. This was observed for all pluggable transports: flash proxies, obfs{2,3} and ScrambleSuit. Vanilla bridges do not seem to be affected by this. Also, the very first bootstrap (i.e., with an empty data directory) also seems to be unaffected by this.
I attached two files containing debug logs (after the second bootstrap) of obfs3 and vanilla bridges. I removed the timestamps to make it easy to compare both of them.
I tested this with Tor v0.2.4.15-rc and obfsproxy da31ffc638a2727df1679bc888c2a8c1c49809a9.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
Looks like this problem also happens on normal bridges. See #9557 (moved) which I just closed as dup, because this is very likely the same problem.
FWIW, I seem to have the opposite problem. My tor (with obfsproxy) bootstraps fine only the first time. Subsequent bootstraps then lead to the 60-second-problem.
Also, I can't reproduce #9557 (moved). I just tested this with tor v0.2.4.16-rc (git-dcf6b6d7dda9ffbd) using:
tor --datadirectory /tmp/tmptor --bridge x.x.x.x:443 --controlport auto --socksport auto
Philipp, I cannot reproduce this in my workstation
Does it always happen to you? Can you remind me if it also happens when you have cached descriptors? Any other info?
Sorry for taking so long to respond.
Yes, it seems to always happen.
I just tested it again with the newest PTTBB. When starting it for the very first time (i.e., with an empty data directory), it bootstraps as expected. Subsequent times stall for 60 seconds. My system is a 64-bit Debian wheezy. The PTTBB is unconfigured except that I disabled flash proxies but I remember dcf mentioning that he experienced this with flash proxies as well. And yes, this also happens with cached descriptors, i.e., the file cached-descriptors.new being non-empty.
How does one reproduce it? I just started a new tor with an empty datadir and a randomly chosen bridge, and it worked fine.
Please see my previous comment. First, I only have problems in combination with obfsproxy. In contrast to what karsten is experiencing, vanilla bridges work fine for me. Second, emtpy datadirs also work fine. The problem only occurs with non-empty datadirs.
Huh, I wonder why nobody can reproduce the problem I'm seeing. Describing it in more detail. Here's my torrc (note the UseBridges 1 part which phw doesn't seem to use, and maybe rightly so):
Huh, I wonder why nobody can reproduce the problem I'm seeing. Describing it in more detail.
Alright, I can now reproduce; even with v0.2.4.16-rc (git-dcf6b6d7dda9ffbd). I missed that you only deleted cached-* files. Here's the log I get with your config. One can see the 60-second-wait.
Aug 22 11:40:46.442 [notice] Parsing GEOIP IPv6 file /usr/share/tor/geoip6.Aug 22 11:40:46.486 [notice] We were built to run on a 64-bit CPU, with OpenSSL 1.0.1 or later, but with a version of OpenSSL that apparently lacks accelerated support for the NIST P-224 and P-256 groups. Building openssl with such support (using the enable-ec_nistp_64_gcc_128 option when configuring it) would make ECDH much faster.Aug 22 11:40:47.506 [notice] Bootstrapped 5%: Connecting to directory server.Aug 22 11:40:47.639 [notice] Bootstrapped 10%: Finishing handshake with directory server.Aug 22 11:40:48.048 [notice] Learned fingerprint [scrubbed].Aug 22 11:40:48.048 [notice] Bootstrapped 15%: Establishing an encrypted directory connection.Aug 22 11:40:48.353 [notice] Bootstrapped 20%: Asking for networkstatus consensus.Aug 22 11:40:48.487 [notice] Bootstrapped 50%: Loading relay descriptors.Aug 22 11:40:48.755 [notice] new bridge descriptor [scrubbed].Aug 22 11:40:48.755 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.Aug 22 11:41:50.359 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.Aug 22 11:41:50.891 [notice] I learned some more directory information, but not enough to build a circuit: We need more microdescriptors: we have 0/4314, and can only build 0% of likely paths. (We have 0% of guards bw, 0% of midpoint bw, and 0% of exit bw.)Aug 22 11:41:52.506 [notice] We now have enough directory information to build circuits.Aug 22 11:41:52.506 [notice] Bootstrapped 80%: Connecting to the Tor network.Aug 22 11:41:52.511 [notice] Bootstrapped 90%: Establishing a Tor circuit.Aug 22 11:41:53.921 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.Aug 22 11:41:53.921 [notice] Bootstrapped 100%: Done.
Changing networkstatus_dl_check_interval() in main.c to a lower value seems to resolve this for me. (Please confirm?) Before we do that in production, though, we need to check whether there are any negative effects (e.g., runaway download attempts on failure) from doing so.
Changing networkstatus_dl_check_interval() in main.c to a lower value seems to resolve this for me. (Please confirm?) Before we do that in production, though, we need to check whether there are any negative effects (e.g., runaway download attempts on failure) from doing so.
It seems to work for me. I lowered the value to 20 and could reduce the waiting time in both cases (karsten's vanilla bridges without cached-* files and my obfsproxy scenario) to 20 seconds.
So, reducing networkstatus_dl_check_interval() in the mainline code still makes me nervous. I see two possible ways forward here:
We could try very hard to make sure that there isn't a runaway-downloads bug that we'd be exacerbating by doing the check more frequently.
We could figure out what event (in this case) we should be triggering a download check in response to, and trigger a download check in response to that.
So, reducing networkstatus_dl_check_interval() in the mainline code still makes me nervous. I see two possible ways forward here:
...
2. We could figure out what event (in this case) we should be triggering a download check in response to, and trigger a download check in response to that.
This is the safer approach.
I tried provoking the bug again, but adding a log message every time we are about to call UPDATE_NETWORKSTATUS_DOWNLOADS from main.c. I saw:
Mar 07 15:21:49.000 [notice] Bootstrapped 5%: Connecting to directory server.Mar 07 15:21:49.000 [notice] UPDATE_NETWORKSTATUS_DOWNLOADSMar 07 15:21:49.000 [notice] Bootstrapped 10%: Finishing handshake with directory server.Mar 07 15:21:49.000 [notice] Bootstrapped 15%: Establishing an encrypted directory connection.Mar 07 15:21:49.000 [notice] Bootstrapped 20%: Asking for networkstatus consensus.Mar 07 15:21:50.000 [notice] new bridge descriptor 'Unnamed' (fresh): $38FC956E026A600F4F66EA559D9CAAC4365E2C93~Unnamed at 192.36.31.159Mar 07 15:21:50.000 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.Mar 07 15:21:50.000 [notice] new bridge descriptor 'braeu5' (fresh): $7683C2EA0896D5C3859CD594D691EE09831C1E8A~braeu5 at 90.225.64.7Mar 07 15:21:50.000 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.Mar 07 15:21:50.000 [notice] new bridge descriptor 'Unnamed' (fresh): $476136C72FF5C937E97E95A7CFA5A44424237B5F~Unnamed at 188.165.251.124Mar 07 15:21:50.000 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus. (long delay here)Mar 07 15:22:50.000 [notice] UPDATE_NETWORKSTATUS_DOWNLOADSMar 07 15:22:50.000 [notice] Bootstrapped 25%: Loading networkstatus consensus.Mar 07 15:22:53.000 [notice] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.Mar 07 15:22:53.000 [notice] Bootstrapped 40%: Loading authority key certs.
So there's a clue that we should be calling update_networkstatus_downloads() in response to getting new bridge descriptors if we don't have a fresh networkstatus.
In learned_bridge_descriptor(), when I reproduce the bug, the body of this if-statement is never executed:
if (first) { routerlist_retry_directory_downloads(now); }
Interesting! Apparently, by the time we hit learned_bridge_descriptor(), any_bridge_descriptors_known() is always true. At least, for me.
The function routerlist_retry_directory_downloads() seems pretty heavy-weight for the general case we're concerned about here too. In my testing, update_networkstatus_downloads() was sufficient. Writing a patch...
Branch "bug9229_024" in my public repository has an alleged fix suitable for merge into 0.2.4 IMO, assuming it tests out okay. (It works for me.)
A more thorough fix might figure out why any_bridge_descriptor_known() is returning true in this case, and figure out a better way to fix that. I have a fix like that in "bug9229_025".
Please review and test?
Trac: Keywords: tbb-needs deleted, tbb-needs 024-backport added Status: new to needs_review
Branch "bug9229_024" in my public repository has an alleged fix suitable for merge into 0.2.4 IMO, assuming it tests out okay. (It works for me.)
Are we sure we don't want an update_all_descriptor_downloads() call too?
I'm thinking of a case where we have an adequate cached consensus, and fetching the bridge descriptor should trigger us to fetch our microdescriptors now.
(That said, I think your fix is an improvement over the current 0.2.4 situation, so it depends how far we want to go.)
13:23 < asn> nickm: i still don't understand why any_bridge_descriptors_known() doesn't do the trick in bug9229_025 13:24 < asn> but I think that the patch can't make the situation worse13:24 < nickm> hm. num_bridges_usable in that branch is totally misnamed13:26 < nickm> also n_options_out is misimplemented and useless
So, I fixed two major bugs in the patch with 6ab10a546663b509d189ce16645b11af0e83d41c.
The reason not to use any_bridge_descriptor_known() is that, by the time we were checking it, we had one good bridge descriptor, and so it was returning false, and so we couldn't actually notice that we had just become live.
Previous testing should probably be considered useless; can anybody test it now?
Seems less than intuitive to find a function called choose_random_entry_impl() which actually either does that or counts the number of entries. Is this just because that was the only place we happened to already have code to construct the list of bridges?
The stuff you fixed in 6ab10a546663b509d189ce16645b11af0e83d41c
Seems less than intuitive to find a function called choose_random_entry_impl() which actually either does that or counts the number of entries. Is this just because that was the only place we happened to already have code to construct the list of bridges?
Right; also, because having the code to construct the list in two places was would be begging for them to fall out of sync.
Merging the 0.2.5 one to 0.2.5 and the 0.2.4 one to 0.2.4.