60-second pause bootstrapping with a bridge if you already have its descriptor

changed milestone to %Tor: 0.2.5.x-final

added component::core tor/tor milestone::Tor: 0.2.5.x-final priority::medium resolution::fixed status::closed tor-bridge type::defect labels

Part 1 of my suggested fix is to call directory_info_has_arrived() from directory.c even when we didn't find anything novel in what we fetched:

diff --git a/src/or/directory.c b/src/or/directory.c
index b94aac4..2d04c67 100644
--- a/src/or/directory.c
+++ b/src/or/directory.c
@@ -1901,10 +1901,10 @@ connection_dir_client_reached_eof(dir_connection_t *conn
       } else {
         //router_load_routers_from_string(body, NULL, SAVED_NOWHERE, which,
         //                       descriptor_digests, conn->router_purpose);
-        if (load_downloaded_routers(body, which, descriptor_digests,
+        load_downloaded_routers(body, which, descriptor_digests,
                                 conn->router_purpose,
-                                conn->base_.address))
-          directory_info_has_arrived(now, 0);
+                                conn->base_.address);
+        directory_info_has_arrived(now, 0);
       }
     }
     if (which) { /* mark remaining ones as failed */

and then part 2 is to make directory_info_has_arrived() see if it wants to fetch consensus stuff as well as descriptor stuff:

diff --git a/src/or/main.c b/src/or/main.c
index 5532026..20b64ec 100644
--- a/src/or/main.c
+++ b/src/or/main.c
@@ -980,10 +980,12 @@ directory_info_has_arrived(time_t now, int from_cache)
     tor_log(quiet ? LOG_INFO : LOG_NOTICE, LD_DIR,
         "I learned some more directory information, but not enough to "
         "build a circuit: %s", get_dir_info_status_string());
+    update_networkstatus_downloads(now);
     update_all_descriptor_downloads(now);
     return;
   } else {
     if (directory_fetches_from_authorities(options)) {
+      update_networkstatus_downloads(now);
       update_all_descriptor_downloads(now);
     }

For normal users, they will already have their consensus stuff by this point, so it will conclude "no, no need to fetch any of that". But for bridge users, inviting them to fetch descriptors or microdescriptors is premature, since they don't have any consensus yet.

And then the optional part three of the patch, maybe to be applied later, is to remove the

      if (first) {
        routerlist_retry_directory_downloads(now);
      }

clause from learned_bridge_descriptor() since now I think it's redundant. As a bonus, that means we also blow away the num_bridges_usable() function since nothing else uses it.

Trac:
Status: new to needs_review

So, my 'third step' is suspicious -- on the second run, shouldn't that routerlist_retry_directory_downloads() be triggering some consensus fetches as soon as the bridge descriptor is read from disk?

And the answer is:

May 14 23:32:11.000 [notice] new bridge descriptor 'Unnamed' (cached): $AF9F66B7B04F8FF6F32D455F05135250A16543C9~Unnamed at 169.229.59.74
May 14 23:32:11.000 [info] should_delay_dir_fetches(): Delaying dir fetches (pt proxies still configuring)

So it does trigger the fetches, but they get cancelled because the pt proxies are still in progress at that point.

Which leads me to ask: is there anything that gets triggered when pt_proxies_configuration_pending() transitions from true to false, or is it likely we have a race condition here if it takes a while to configure our PTs?

Replying to arma:

So, my 'third step' is suspicious -- on the second run, shouldn't that routerlist_retry_directory_downloads() be triggering some consensus fetches as soon as the bridge descriptor is read from disk?

And the answer is: {{{ May 14 23:32:11.000 [notice] new bridge descriptor 'Unnamed' (cached): $AF9F66B7B04F8FF6F32D455F05135250A16543C9~Unnamed at 169.229.59.74 May 14 23:32:11.000 [info] should_delay_dir_fetches(): Delaying dir fetches (pt proxies still configuring) }}}

So it does trigger the fetches, but they get cancelled because the pt proxies are still in progress at that point.

Correct. That block was added in #11156 (moved), because Tor was trying to connect to destinations while the PTs were still doing the managed-proxy configuration protocol

Which leads me to ask: is there anything that gets triggered when pt_proxies_configuration_pending() transitions from true to false, or is it likely we have a race condition here if it takes a while to configure our PTs?

Hm, the managed-proxy configuration protocol takes 2-3 ticks of run_scheduled_events() to complete. Since in this case, bridge descriptors were read from disk (maybe using router_reload_router_list from do_main_loop()), it's likely that they were read and parsed before the PTs stopped configuring.

Not sure what's the best fix here.

A pretty stpid one that might work would be to move router_reload_router_list() only after PTs have been configured.

Another fix might involve queuing up those descriptor/consensus fetches, and firing them up when PTs have been configured. This sounds a bit more involved, and requires good understanding of the networking logic of Tor (when documents get fetched, etc.) so that we don't mess things up.

Another fix might involve performing the PT configuration protocol in a synchronous/blocking way immediately upon config read, so that we finish with it the soonest possible and stop worrying about it later. This might also be hard to evalute its correctness. I'm not sure how much time the PT configuration protocol would take if it was done synchronously; probably not too much. I'm only suggesting this because the same functionality caused another 60s bug in the past (ticket:11156:comment:19).

More fix approaches are probably possible.

Looks like this is needs_revision, since if I'm reading the above comments right, the fix doesn't fix it?

Trac:
Status: needs_review to needs_revision

Replying to nickm:

Looks like this is needs_revision, since if I'm reading the above comments right, the fix doesn't fix it?

My 'part 1' and 'part 2' together fix dcf's bug (this ticket).

I think my 'part 3' is then safe to apply on top of those two, as a code simplification, if we want.

I think somebody should open another ticket for the "when the PTs are done initializing, nothing gets called to launch whatever directory requests are wanted next" bug (the one I speculate about, and asn describes more, above).

Trac:
Status: needs_revision to needs_review

Okay, will review. Should this get considered for 0.2.4 too? What version is this a bugfix on?

Replying to arma:

Which leads me to ask: is there anything that gets triggered when pt_proxies_configuration_pending() transitions from true to false, or is it likely we have a race condition here if it takes a while to configure our PTs?

asn, what would you think about the following patch?

diff --git a/src/or/main.c b/src/or/main.c
index a2c5743..88a0793 100644
--- a/src/or/main.c
+++ b/src/or/main.c
@@ -1231,7 +1231,8 @@ run_scheduled_events(time_t now)
       router_upload_dir_desc_to_dirservers(0);
   }
 
-  if (!options->DisableNetwork && time_to_try_getting_descriptors < now) {
+  if (!should_delay_dir_fetches(options, NULL) &&
+      time_to_try_getting_descriptors < now) {
     update_all_descriptor_downloads(now);
     update_extrainfo_downloads(now);
     if (router_have_minimum_dir_info())
@@ -1461,7 +1462,8 @@ run_scheduled_events(time_t now)
  * documents? */
 #define networkstatus_dl_check_interval(o) ((o)->TestingTorNetwork ? 1 : 60)
 
-  if (time_to_download_networkstatus < now && !options->DisableNetwork) {
+  if (!should_delay_dir_fetches(options, NULL) &&
+      time_to_download_networkstatus < now) {
     time_to_download_networkstatus =
       now + networkstatus_dl_check_interval(options);
     update_networkstatus_downloads(now);

Basically, we skip the "should we launch those two classes of dir fetches" steps not just when DisableNetwork is set, but also when you're using bridges and you don't have any useful descriptors yet, and when you're using bridges and you have pt_proxies_configuration_pending(). And then when should_delay_dir_fetches goes false, time_to_foo is before now, so they fire on the next run_scheduled_events().

dcf, can you try the above patch (and only this patch), and see if it resolves your original problem too?

Trac:
Cc: N/A to dcf

Replying to arma:

dcf, can you try the above patch (and only this patch), and see if it resolves your original problem too?

Without the patch, 767b18ea8eebc2c06a2d3d76d7c4e1ac4d70827d stalls at 20%.

With the patch, it does not stall.

The torrc I used was

DataDirectory datadir
UseBridges 1
Bridge obfs3 169.229.59.74:31493 AF9F66B7B04F8FF6F32D455F05135250A16543C9
ClientTransportPlugin obfs3 exec /usr/local/bin/obfsproxy managed

I note that should_delay_dir_fetches() doesn't look at DisableNetwork. Does the above patch (the one in comment 7) generate warnings with DisableNetwork?

Replying to nickm:

I note that should_delay_dir_fetches() doesn't look at DisableNetwork. Does the above patch (the one in comment 7) generate warnings with DisableNetwork?

I added DisableNetwork 1 to the torrc from comment:8, and got:

May 20 10:51:45.703 [notice] Tor v0.2.5.4-alpha-dev (git-767b18ea8eebc2c0) running on Linux with Libevent 2.0.21-stable, OpenSSL 1.0.1h-dev and Zlib 1.2.8.
May 20 10:51:45.704 [notice] Tor can't help you if you use it wrong! Learn how to be safe at https://www.torproject.org/download/download#warning
May 20 10:51:45.704 [notice] This version is not a stable Tor release. Expect more bugs than usual.
May 20 10:51:45.704 [notice] Read configuration file "/home/david/tmp/torrc".
May 20 10:51:45.710 [notice] DisableNetwork is set. Tor will not make or accept non-control network connections. Shutting down all existing connections.
May 20 10:51:45.000 [notice] Bootstrapped 0%: Starting
May 20 10:51:45.000 [notice] new bridge descriptor 'Unnamed' (cached): $AF9F66B7B04F8FF6F32D455F05135250A16543C9~Unnamed at 169.229.59.74

info-level logging prints a lot more, but I suppose what you're interested in is:

May 20 10:54:39.000 [info] I learned some more directory information, but not enough to build a circuit: We have no usable consensus.
May 20 10:54:40.000 [info] should_delay_dir_fetches(): Delaying dir fetches (DisableNetwork is set)
May 20 10:54:40.000 [info] should_delay_dir_fetches(): Delaying dir fetches (DisableNetwork is set)
May 20 10:54:40.000 [info] or_state_save(): Saved state to "datadir/state"
May 20 10:54:41.000 [info] should_delay_dir_fetches(): Delaying dir fetches (DisableNetwork is set)
May 20 10:54:41.000 [info] should_delay_dir_fetches(): Delaying dir fetches (DisableNetwork is set)
May 20 10:54:42.000 [info] should_delay_dir_fetches(): Delaying dir fetches (DisableNetwork is set)
May 20 10:54:42.000 [info] should_delay_dir_fetches(): Delaying dir fetches (DisableNetwork is set)

I note that should_delay_dir_fetches() doesn't look at DisableNetwork

Actually, it does, since d01cf18ecbb02bca7e4168b28f28d8b84f03da70 . I was looking at a branch based on an old version of master.

I'm merging this, but I wasn't able to easily figure out where the bug first occurred. Is it as old as bridges? Investigation would be appreciated. Otherwise I won't be able to write a "bugfix on XXX" in the changelog.

Fix merged as de60284e7631. I hope I got the changes file right. Putting in needs_revision for somebody to look for the version where this bug first occurred. Will close if nobody does before 0.2.5.5.

Trac:
Status: needs_review to needs_revision

My bisect found 5f13ae4b032d033ec51bb79fc69717e02faaefd1 as the first commit where the stall at 20% happens. I got the impression from arma that it was related to other changes, though, and only revealed by that commit, not caused by it.

Replying to arma:

Replying to arma:

Which leads me to ask: is there anything that gets triggered when pt_proxies_configuration_pending() transitions from true to false, or is it likely we have a race condition here if it takes a while to configure our PTs?

asn, what would you think about the following patch?

Basically, we skip the "should we launch those two classes of dir fetches" steps not just when DisableNetwork is set, but also when you're using bridges and you don't have any useful descriptors yet, and when you're using bridges and you have pt_proxies_configuration_pending(). And then when should_delay_dir_fetches goes false, time_to_foo is before now, so they fire on the next run_scheduled_events().

Sounds like it might work. I can test this approach in 2-3 days.

Roger's patch is bug11965 in my public repo. Could use a little attention

I was never able to reproduce this issue without pluggable transports. (I thought I was, but I now think I just had a slow bridge).

So I think the patch in bug11965 is a fine patch, but I'm only certain it applies to the case where you're using bridges and also pluggable transports. I think that means bugfix on 0.2.3.6-alpha?

I'm trying to reproduce this issue, but I can't trigger it after the patch to #9229 (moved).

I'm doing the following. I have a simple torrc that connets to an obfs3 bridge. I bootstrap to that bridge once. Then kill Tor. Then try to bootstrap a second time immediately afterwards. I bootstrap normally with the #9229 (moved) patch. Without it, it takes 60s.

SocksPort auto

UseBridges 1
Bridge obfs3 83.212.101.3:80
ClientTransportPlugin scramblesuit,obfs3 exec /usr/local/bin/obfsproxy managed

I'm using tor git master btw (fef65fa64341fb70df0e7b34d91d3b08a74e7aad) no extra patches. cached-descriptors.new was populated with the bridge descriptor as normal.

Replying to asn:

I'm trying to reproduce this issue, but I can't trigger it after the patch to #9229 (moved).

I'm doing the following. I have a simple torrc that connets to an obfs3 bridge. I bootstrap to that bridge once. Then kill Tor. Then try to bootstrap a second time immediately afterwards. I bootstrap normally with the #9229 (moved) patch. Without it, it takes 60s.

{{{ SocksPort auto

UseBridges 1 Bridge obfs3 83.212.101.3:80 ClientTransportPlugin scramblesuit,obfs3 exec /usr/local/bin/obfsproxy managed }}}

I'm using tor git master btw (fef65fa64341fb70df0e7b34d91d3b08a74e7aad) no extra patches. cached-descriptors.new was populated with the bridge descriptor as normal.

Maybe try a different DataDirectory? It also might matter that your Bridge line have a fingerprint.

asn: have your bridge fingerprint (edit: no, not fingerprint, descriptor) in your cached-descriptors file, and then delete your cached-micro* files, and then try.

Replying to dcf:

Replying to asn:

I'm using tor git master btw (fef65fa64341fb70df0e7b34d91d3b08a74e7aad) no extra patches. cached-descriptors.new was populated with the bridge descriptor as normal.

...It also might matter that your Bridge line have a fingerprint.

That's exactly the problem. I was able to reproduce asn's inability to reproduce the bug by using a bridge line without a fingerprint. Once I added the fingerprint, the problem resurfaced. Reproduced both on fef65fa64341fb70df0e7b34d91d3b08a74e7aad and the now-current master (cfd0ee514c279bc6c7b7c299e001693a5aeb1f5f).

Like arma, I was only able to reproduce this with pluggable transports. The nickm/bug11965 branch fixes it for me.

Trac:
Username: alphawolf

added the version to the changelog, and merged this. Calling it Fixed; please reopen if it isn't.

Trac:
Resolution: N/A to fixed
Status: needs_revision to closed

Yes! I reproduced this by adding a fingeprint to my torrc. Without a fingerprint, it was not loading the bridge descriptor from the cache because of ROUTER_WAS_NOT_WANTED.

arma/nick's branch works fine and solves the 60s problem for me!

closed

mentioned in issue #11966 (moved)

mentioned in issue #12020 (moved)

moved to tpo/core/tor#11965 (closed)

mentioned in issue tpo/core/tor#12020 (closed)

60-second pause bootstrapping with a bridge if you already have its descriptor

Child items ...

Activity