If you configure your Tor client to use a bridge (and maybe you need to also use a pluggable transport to trigger this bug -- dcf did but I don't currently see why), the first time you run it you'll fetch the bridge descriptor and cache it to disk. The act of fetching it and finding it novel will trigger the code in learned_bridge_descriptor():
if (first) { routerlist_retry_directory_downloads(now); }
which will spur you to move on to fetch a consensus from your bridge.
But the second time you run it, if it's soon after, you won't find the bridge descriptor novel:
May 14 23:32:14.000 [info] connection_dir_client_reached_eof(): Received server info (size 1165) from server '169.x.x.x:y'May 14 23:32:14.000 [debug] router_parse_list_from_string(): Read router '$AF9F[...]~Unnamed at 169.x.x.x', purpose 'bridge'May 14 23:32:14.000 [info] router_load_routers_from_string(): 1 elements to addMay 14 23:32:14.000 [info] router_add_to_routerlist(): Dropping descriptor that we already have for router $AF9F[...]~Unnamed at 169.x.x.x
which means you won't call anything to move on to fetching the consensus. 60 seconds later the timer in main.c will trigger and you'll ask for it.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
and then part 2 is to make directory_info_has_arrived() see if it wants to fetch consensus stuff as well as descriptor stuff:
diff --git a/src/or/main.c b/src/or/main.cindex 5532026..20b64ec 100644--- a/src/or/main.c+++ b/src/or/main.c@@ -980,10 +980,12 @@ directory_info_has_arrived(time_t now, int from_cache) tor_log(quiet ? LOG_INFO : LOG_NOTICE, LD_DIR, "I learned some more directory information, but not enough to " "build a circuit: %s", get_dir_info_status_string());+ update_networkstatus_downloads(now); update_all_descriptor_downloads(now); return; } else { if (directory_fetches_from_authorities(options)) {+ update_networkstatus_downloads(now); update_all_descriptor_downloads(now); }
For normal users, they will already have their consensus stuff by this point, so it will conclude "no, no need to fetch any of that". But for bridge users, inviting them to fetch descriptors or microdescriptors is premature, since they don't have any consensus yet.
And then the optional part three of the patch, maybe to be applied later, is to remove the
if (first) { routerlist_retry_directory_downloads(now); }
clause from learned_bridge_descriptor() since now I think it's redundant. As a bonus, that means we also blow away the num_bridges_usable() function since nothing else uses it.
So, my 'third step' is suspicious -- on the second run, shouldn't that routerlist_retry_directory_downloads() be triggering some consensus fetches as soon as the bridge descriptor is read from disk?
And the answer is:
May 14 23:32:11.000 [notice] new bridge descriptor 'Unnamed' (cached): $AF9F66B7B04F8FF6F32D455F05135250A16543C9~Unnamed at 169.229.59.74May 14 23:32:11.000 [info] should_delay_dir_fetches(): Delaying dir fetches (pt proxies still configuring)
So it does trigger the fetches, but they get cancelled because the pt proxies are still in progress at that point.
Which leads me to ask: is there anything that gets triggered when pt_proxies_configuration_pending() transitions from true to false, or is it likely we have a race condition here if it takes a while to configure our PTs?
So, my 'third step' is suspicious -- on the second run, shouldn't that routerlist_retry_directory_downloads() be triggering some consensus fetches as soon as the bridge descriptor is read from disk?
And the answer is:
{{{
May 14 23:32:11.000 [notice] new bridge descriptor 'Unnamed' (cached): $AF9F66B7B04F8FF6F32D455F05135250A16543C9~Unnamed at 169.229.59.74
May 14 23:32:11.000 [info] should_delay_dir_fetches(): Delaying dir fetches (pt proxies still configuring)
}}}
So it does trigger the fetches, but they get cancelled because the pt proxies are still in progress at that point.
Correct. That block was added in #11156 (moved), because Tor was trying to connect to destinations while the PTs were still doing the managed-proxy configuration protocol
Which leads me to ask: is there anything that gets triggered when pt_proxies_configuration_pending() transitions from true to false, or is it likely we have a race condition here if it takes a while to configure our PTs?
Hm, the managed-proxy configuration protocol takes 2-3 ticks of run_scheduled_events() to complete. Since in this case, bridge descriptors were read from disk (maybe using router_reload_router_list from do_main_loop()), it's likely that they were read and parsed before the PTs stopped configuring.
Not sure what's the best fix here.
A pretty stpid one that might work would be to move router_reload_router_list() only after PTs have been configured.
Another fix might involve queuing up those descriptor/consensus fetches, and firing them up when PTs have been configured. This sounds a bit more involved, and requires good understanding of the networking logic of Tor (when documents get fetched, etc.) so that we don't mess things up.
Another fix might involve performing the PT configuration protocol in a synchronous/blocking way immediately upon config read, so that we finish with it the soonest possible and stop worrying about it later. This might also be hard to evalute its correctness. I'm not sure how much time the PT configuration protocol would take if it was done synchronously; probably not too much. I'm only suggesting this because the same functionality caused another 60s bug in the past (ticket:11156:comment:19).
Looks like this is needs_revision, since if I'm reading the above comments right, the fix doesn't fix it?
My 'part 1' and 'part 2' together fix dcf's bug (this ticket).
I think my 'part 3' is then safe to apply on top of those two, as a code simplification, if we want.
I think somebody should open another ticket for the "when the PTs are done initializing, nothing gets called to launch whatever directory requests are wanted next" bug (the one I speculate about, and asn describes more, above).
Which leads me to ask: is there anything that gets triggered when pt_proxies_configuration_pending() transitions from true to false, or is it likely we have a race condition here if it takes a while to configure our PTs?
asn, what would you think about the following patch?
diff --git a/src/or/main.c b/src/or/main.cindex a2c5743..88a0793 100644--- a/src/or/main.c+++ b/src/or/main.c@@ -1231,7 +1231,8 @@ run_scheduled_events(time_t now) router_upload_dir_desc_to_dirservers(0); }- if (!options->DisableNetwork && time_to_try_getting_descriptors < now) {+ if (!should_delay_dir_fetches(options, NULL) &&+ time_to_try_getting_descriptors < now) { update_all_descriptor_downloads(now); update_extrainfo_downloads(now); if (router_have_minimum_dir_info())@@ -1461,7 +1462,8 @@ run_scheduled_events(time_t now) * documents? */ #define networkstatus_dl_check_interval(o) ((o)->TestingTorNetwork ? 1 : 60)- if (time_to_download_networkstatus < now && !options->DisableNetwork) {+ if (!should_delay_dir_fetches(options, NULL) &&+ time_to_download_networkstatus < now) { time_to_download_networkstatus = now + networkstatus_dl_check_interval(options); update_networkstatus_downloads(now);
Basically, we skip the "should we launch those two classes of dir fetches" steps not just when DisableNetwork is set, but also when you're using bridges and you don't have any useful descriptors yet, and when you're using bridges and you have pt_proxies_configuration_pending(). And then when should_delay_dir_fetches goes false, time_to_foo is before now, so they fire on the next run_scheduled_events().
I note that should_delay_dir_fetches() doesn't look at DisableNetwork. Does the above patch (the one in comment 7) generate warnings with DisableNetwork?
I note that should_delay_dir_fetches() doesn't look at DisableNetwork. Does the above patch (the one in comment 7) generate warnings with DisableNetwork?
I added DisableNetwork 1 to the torrc from comment:8, and got:
May 20 10:51:45.703 [notice] Tor v0.2.5.4-alpha-dev (git-767b18ea8eebc2c0) running on Linux with Libevent 2.0.21-stable, OpenSSL 1.0.1h-dev and Zlib 1.2.8.May 20 10:51:45.704 [notice] Tor can't help you if you use it wrong! Learn how to be safe at https://www.torproject.org/download/download#warningMay 20 10:51:45.704 [notice] This version is not a stable Tor release. Expect more bugs than usual.May 20 10:51:45.704 [notice] Read configuration file "/home/david/tmp/torrc".May 20 10:51:45.710 [notice] DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections.May 20 10:51:45.000 [notice] Bootstrapped 0%: StartingMay 20 10:51:45.000 [notice] new bridge descriptor 'Unnamed' (cached): $AF9F66B7B04F8FF6F32D455F05135250A16543C9~Unnamed at 169.229.59.74
info-level logging prints a lot more, but I suppose what you're interested in is:
May 20 10:54:39.000 [info] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.May 20 10:54:40.000 [info] should_delay_dir_fetches(): Delaying dir fetches (DisableNetwork is set)May 20 10:54:40.000 [info] should_delay_dir_fetches(): Delaying dir fetches (DisableNetwork is set)May 20 10:54:40.000 [info] or_state_save(): Saved state to "datadir/state"May 20 10:54:41.000 [info] should_delay_dir_fetches(): Delaying dir fetches (DisableNetwork is set)May 20 10:54:41.000 [info] should_delay_dir_fetches(): Delaying dir fetches (DisableNetwork is set)May 20 10:54:42.000 [info] should_delay_dir_fetches(): Delaying dir fetches (DisableNetwork is set)May 20 10:54:42.000 [info] should_delay_dir_fetches(): Delaying dir fetches (DisableNetwork is set)
I'm merging this, but I wasn't able to easily figure out where the bug first occurred. Is it as old as bridges? Investigation would be appreciated. Otherwise I won't be able to write a "bugfix on XXX" in the changelog.
Fix merged as de60284e7631. I hope I got the changes file right. Putting in needs_revision for somebody to look for the version where this bug first occurred. Will close if nobody does before 0.2.5.5.
My bisect found 5f13ae4b032d033ec51bb79fc69717e02faaefd1 as the first commit where the stall at 20% happens. I got the impression from arma that it was related to other changes, though, and only revealed by that commit, not caused by it.
Which leads me to ask: is there anything that gets triggered when pt_proxies_configuration_pending() transitions from true to false, or is it likely we have a race condition here if it takes a while to configure our PTs?
asn, what would you think about the following patch?
Basically, we skip the "should we launch those two classes of dir fetches" steps not just when DisableNetwork is set, but also when you're using bridges and you don't have any useful descriptors yet, and when you're using bridges and you have pt_proxies_configuration_pending(). And then when should_delay_dir_fetches goes false, time_to_foo is before now, so they fire on the next run_scheduled_events().
Sounds like it might work. I can test this approach in 2-3 days.
I was never able to reproduce this issue without pluggable transports. (I thought I was, but I now think I just had a slow bridge).
So I think the patch in bug11965 is a fine patch, but I'm only certain it applies to the case where you're using bridges and also pluggable transports. I think that means bugfix on 0.2.3.6-alpha?
I'm trying to reproduce this issue, but I can't trigger it after the patch to #9229 (moved).
I'm doing the following. I have a simple torrc that connets to an obfs3 bridge. I bootstrap to that bridge once. Then kill Tor. Then try to bootstrap a second time immediately afterwards. I bootstrap normally with the #9229 (moved) patch. Without it, it takes 60s.
I'm using tor git master btw (fef65fa64341fb70df0e7b34d91d3b08a74e7aad) no extra patches. cached-descriptors.new was populated with the bridge descriptor as normal.
I'm trying to reproduce this issue, but I can't trigger it after the patch to #9229 (moved).
I'm doing the following. I have a simple torrc that connets to an obfs3 bridge. I bootstrap to that bridge once. Then kill Tor. Then try to bootstrap a second time immediately afterwards. I bootstrap normally with the #9229 (moved) patch. Without it, it takes 60s.
I'm using tor git master btw (fef65fa64341fb70df0e7b34d91d3b08a74e7aad) no extra patches. cached-descriptors.new was populated with the bridge descriptor as normal.
Maybe try a different DataDirectory? It also might matter that your Bridge line have a fingerprint.
asn: have your bridge fingerprint (edit: no, not fingerprint, descriptor) in your cached-descriptors file, and then delete your cached-micro* files, and then try.
I'm using tor git master btw (fef65fa64341fb70df0e7b34d91d3b08a74e7aad) no extra patches. cached-descriptors.new was populated with the bridge descriptor as normal.
...It also might matter that your Bridge line have a fingerprint.
That's exactly the problem. I was able to reproduce asn's inability to reproduce the bug by using a bridge line without a fingerprint. Once I added the fingerprint, the problem resurfaced. Reproduced both on fef65fa64341fb70df0e7b34d91d3b08a74e7aad and the now-current master (cfd0ee514c279bc6c7b7c299e001693a5aeb1f5f).
Like arma, I was only able to reproduce this with pluggable transports. The nickm/bug11965 branch fixes it for me.
Yes! I reproduced this by adding a fingeprint to my torrc. Without a fingerprint, it was not loading the bridge descriptor from the cache because of ROUTER_WAS_NOT_WANTED.
arma/nick's branch works fine and solves the 60s problem for me!