Opened 6 years ago

Closed 6 years ago

#9946 closed defect (fixed)

Bootstrapping Tor client hangs up on its chosen guards

Reported by: arma Owned by:
Priority: Medium Milestone: Tor: 0.2.4.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-client
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Here are some log snippets from a freshly bootstrapping Tor client:

Sep 27 03:24:11.369 [info] add_an_entry_guard(): Chose $3DD2523F1B241F01D54818F327714CDA7F54238B~wannabe2 at 144.76.110.103 as new entry guard.
Sep 27 03:24:11.369 [info] log_entry_guards(): wannabe2 [3DD2523F1B241F01D54818F327714CDA7F54238B] (up never-contacted)
Sep 27 03:24:11.371 [info] add_an_entry_guard(): Chose $821FAFA24651419FAE76A5C80BA79EA7E0743C4D=servbr3 at 212.83.158.171 as new entry guard.
Sep 27 03:24:11.371 [info] log_entry_guards(): wannabe2 [3DD2523F1B241F01D54818F327714CDA7F54238B] (up never-contacted),servbr3 [821FAFA24651419FAE76A5C80BA79EA7E0743C4D] (up never-contacted)
Sep 27 03:24:11.373 [info] add_an_entry_guard(): Chose $CFB4CB595C2504444E8F8F76CAC42F0C97200734=SECxFreeBSD64 at 195.169.125.226 as new entry guard.
Sep 27 03:24:11.373 [info] log_entry_guards(): wannabe2 [3DD2523F1B241F01D54818F327714CDA7F54238B] (up never-contacted),servbr3 [821FAFA24651419FAE76A5C80BA79EA7E0743C4D] (up never-contacted),SECxFreeBSD64 [CFB4CB595C2504444E8F8F76CAC42F0C97200734] (up never-contacted)

So now I have my three guards, and they're all considered up (the consensus said so) and never-contacted (that's how they start out).

Sep 27 03:24:11.362 [info] launch_descriptor_downloads(): Launching 45 requests
 for 4084 microdescs, 92 at a time

Then choose_random_entry_impl() runs this code:

      if (!entry->made_contact) {
        /* Always start with the first not-yet-contacted entry
         * guard. Otherwise we might add several new ones, pick
         * the second new one, and now we've expanded our entry
         * guard list without needing to. */
        goto choose_and_finish;
      }

so all 45 of those microdescriptor requests decide to ask wannabe2. It queues its 45 directory attempts.

So it connects to wannabe2:

Sep 27 03:24:11.373 [debug] directory_initiate_command_rend(): Initiating microdescriptor fetch
Sep 27 03:24:11.373 [info] connection_ap_make_link(): Making internal direct tunnel to 144.76.110.103:443 ...
[...]
Sep 27 03:24:11.373 [debug] circuit_handle_first_hop(): Looking for firsthop '144.76.110.103:443'
Sep 27 03:24:11.373 [info] circuit_handle_first_hop(): Next router is $3DD2523F1B241F01D54818F327714CDA7F54238B~3DD2523F1B241F01D54 at 144.76.110.103: Not connected. Connecting.
[...]
Sep 27 03:24:11.373 [debug] connection_connect(): Connecting to "144.76.110.103":443.
Sep 27 03:24:11.373 [debug] connection_connect(): Connection to "144.76.110.103":443 in progress (sock 13).
Sep 27 03:24:11.373 [debug] connection_add_impl(): new conn type OR, socket 13, address 144.76.110.103, n_conns 7.
[...]
Sep 27 03:24:11.373 [debug] circuit_handle_first_hop(): connecting in progress (or finished). Good.
[...]
Sep 27 03:24:11.484 [debug] connection_or_finished_connecting(): OR connect() to router at 144.76.110.103:443 finished.
Sep 27 03:24:11.485 [debug] connection_tls_start_handshake(): starting TLS handshake on fd 13
[...]
Sep 27 03:24:11.937 [info] channel_tls_process_versions_cell(): Negotiated version 4 with 144.76.110.103:443; Waiting for CERTS cell
Sep 27 03:24:11.937 [debug] connection_or_process_cells_from_inbuf(): 13: starting, inbuf_datalen 1473 (0 pending in tls object).
Sep 27 03:24:11.938 [info] channel_tls_process_certs_cell(): Got some good certificates from 144.76.110.103:443: Authenticated it.
[...]
Sep 27 03:24:11.938 [debug] circuit_n_chan_done(): chan to NULL/144.76.110.103:443, status=1
Sep 27 03:24:11.938 [debug] circuit_n_chan_done(): Found circ, sending create cell.
[...]
Sep 27 03:24:11.938 [info] circuit_send_next_onion_skin(): First hop: finished sending CREATE_FAST cell to '$3DD2523F1B241F01D54818F327714CDA7F54238B~wannabe2 at 144.76.110.103'
Sep 27 03:24:11.938 [info] channel_tls_process_netinfo_cell(): Got good NETINFO cell from 144.76.110.103:443; OR connection is now open, using protocol version 4. Its ID digest is 3DD2523F1B241F01D54818F327714CDA7F54238B. Our address is apparently 76.99.61.6.

But wait, what's that? We send our CREATE_FAST cells before we get our NETINFO cell? Not sure if that is a bug.

In any case, time passes, and we give up on all those circuits:

Sep 27 03:25:12.150 [notice] No circuits are opened. Relaxed timeout for circuit 3 (a General-purpose client 1-hop circuit in state doing handshakes with channel state open) to 60000ms. However, it appears the circuit has timed out anyway. 0 guards are live.
Sep 27 03:25:12.150 [info] circuit_expire_building(): Abandoning circ 3 144.76.110.103:443:-2147451209 (state 0,0:doing handshakes, purpose 5, len 1)
Sep 27 03:25:12.150 [info] internal circ (length 1, last hop 3DD2523F1B241F01D54818F327714CDA7F54238B): $3DD2523F1B241F01D54818F327714CDA7F54238B(waiting for keys)
[...]
Sep 27 03:25:12.150 [info] circuit_build_failed(): Our circuit failed to get a response from the first hop (144.76.110.103:443). I'm going to try to rotate to a better connection.
Sep 27 03:25:12.150 [info] entry_guard_register_connect_status(): Unable to connect to entry guard 'wannabe2' (3DD2523F1B241F01D54818F327714CDA7F54238B). Marking as unreachable.

and then we give up on the 45 directory requests:

Sep 27 03:25:12.150 [info] connection_ap_fail_onehop(): Closing one-hop stream to '$3DD2523F1B241F01D54818F327714CDA7F54238B/144.76.110.103' because the OR conn just failed.
Sep 27 03:25:12.150 [info] connection_ap_fail_onehop(): Closing one-hop stream to '$3DD2523F1B241F01D54818F327714CDA7F54238B/144.76.110.103' because the OR conn just failed.
[and 43 more]

So let's try again! We marked our first guard down so now we'll ask our second guard:

Sep 27 03:25:14.151 [debug] launch_descriptor_downloads(): There are enough downloadable microdescs to launch requests.
Sep 27 03:25:14.151 [info] launch_descriptor_downloads(): Launching 45 requests for 4084 microdescs, 92 at a time
Sep 27 03:25:14.159 [info] onion_pick_cpath_exit(): Using requested exit node '$821FAFA24651419FAE76A5C80BA79EA7E0743C4D~821FAFA24651419FAE7 at 212.83.158.171'
[...]
Sep 27 03:25:14.160 [info] circuit_handle_first_hop(): Next router is $821FAFA24651419FAE76A5C80BA79EA7E0743C4D~821FAFA24651419FAE7 at 212.83.158.171: Not connected. Connecting.
[...]
Sep 27 03:25:14.160 [debug] connection_connect(): Connecting to "212.83.158.171":443.
Sep 27 03:25:14.160 [debug] connection_connect(): Connection to "212.83.158.171":443 in progress (sock 13).
Sep 27 03:25:14.160 [debug] connection_add_impl(): new conn type OR, socket 13, address 212.83.158.171, n_conns 6.
[...]
Sep 27 03:25:14.261 [debug] connection_or_finished_connecting(): OR connect() to router at 212.83.158.171:443 finished.
[...]
Sep 27 03:25:14.683 [info] channel_tls_process_versions_cell(): Negotiated version 4 with 212.83.158.171:443; Waiting for CERTS cell
[...]
Sep 27 03:25:14.683 [info] channel_tls_process_certs_cell(): Got some good certificates from 212.83.158.171:443: Authenticated it.

And here's where it gets good:

Sep 27 03:25:14.683 [info] entry_guard_register_connect_status(): Connected to new entry guard 'servbr3' (821FAFA24651419FAE76A5C80BA79EA7E0743C4D). Marking earlier entry guards up. 0/3 entry guards usable/new.
Sep 27 03:25:14.683 [info] log_entry_guards(): wannabe2 [3DD2523F1B241F01D54818F327714CDA7F54238B] (up made-contact),servbr3 [821FAFA24651419FAE76A5C80BA79EA7E0743C4D] (up made-contact),SECxFreeBSD64 [CFB4CB595C2504444E8F8F76CAC42F0C97200734] (up never-contacted)
Sep 27 03:25:14.683 [debug] channel_do_open_actions(): New entry guard was reachable, but closing this connection so we can retry the earlier entry guards.
Sep 27 03:25:14.683 [debug] circuit_n_chan_done(): chan to NULL/212.83.158.171:443, status=0
Sep 27 03:25:14.683 [info] circuit_n_chan_done(): Channel failed; closing circ.
Sep 27 03:25:14.683 [info] circuit_build_failed(): Our circuit died before the first hop with no connection
Sep 27 03:25:14.683 [info] entry_guard_register_connect_status(): Unable to connect to entry guard 'servbr3' (821FAFA24651419FAE76A5C80BA79EA7E0743C4D). Marking as unreachable.

And then we close all the directory requests that we'd queued for it because we closed the connection:

Sep 27 03:25:14.683 [info] connection_ap_fail_onehop(): Closing one-hop stream to '$821FAFA24651419FAE76A5C80BA79EA7E0743C4D/212.83.158.171' because the OR conn just failed.
[and 44 more]

And then my client launches 45 new directory requests, which spread over the three available guards and things bootstrap ok.

It's the "New entry guard was reachable, but closing this connection so we can retry the earlier entry guards." behavior that is the bug here. I think the fix might be to mark newly selected guards as already-contacted when you first pick them at the beginning of time.

Child Tickets

Change History (13)

comment:1 Changed 6 years ago by arma

I think we should fix this by adding another flag to add_an_entry_guard(), where if we're adding guards because we don't have enough, we set it to made-contact, whereas if we're adding guards because all of ours are down, we set it to never-contacted.

comment:2 Changed 6 years ago by arma

(We'll actually be spreading our directory fetches over all our directory guards at bootstrap this way. So it is probably less secure, since talking to more than one guard is bad. But a) we quite quickly start to use the other guards, so it's not like we actually keep the security, and b) if we wanted to use fewer guards we should do that in #9273.)

comment:3 Changed 6 years ago by arma

Nick doesn't like overloading the meaning of a current flag which has a name that makes it sound like you can guess its meaning.

New suggestion: add a new flag, exploratory, which is set when we would rather use our previous guards.

comment:4 Changed 6 years ago by nickm

I'm fine with that idea (suggestion: is_exploratory), I think. Can you be explicit about when we set it, when we clear it, how it persists, and what effects it has?

Are you really thinking this is simple enough for 0.2.4?

comment:5 Changed 6 years ago by arma

Well, the reason I picked 0.2.4 is because this is a regression introduced in 0.2.4 (because of the directory guard thing).

comment:6 Changed 6 years ago by nickm

Is there a way to do this simply in 0.2.4, and right in 0.2.5?

comment:7 Changed 6 years ago by arma

Status: newneeds_review

See my bug9946 branch. It does it the simpler way in 0.2.4, by overloading the made_contact field as I originally suggested.

I'm beginning to think the better fix after this (for 0.2.4 or 0.2.5, I don't think I care) is to rename made_contact to something like for_discovery, and be done with it.

comment:8 Changed 6 years ago by nickm

Looks okay to me. How did you test this?

For a better name for the field, "for_discovery" doesn't make sense to me. Words "Provisional" or "probationary" or make more sense. That's fine as a separate patch for 0.2.5.

(This function is approaching the critical threshold of boolean arguments where all code using it will be unreadable without care. Not something to fix in 0.2.4 though.)

comment:9 Changed 6 years ago by arma

I tested it by bootstrapping a Tor client a few times. Seems to work better than before.

(I agree about the function arguments.)

comment:10 Changed 6 years ago by nickm

Okay. Did you specifically observe that the failure mode in the description of this bug is not happening any more with this patch?

comment:11 Changed 6 years ago by arma

Yes -- all the guards are marked made-contact, so we don't abandon them when we connect to them.

That said, there are still ways to improve the situation here.

A) if one of our three guards is a bust, it will take 120 seconds for requests to it to time out, and even if the other 2 guards are working great, that typically isn't enough for enough to get to 80%. So there will remain cases where we wait 2 minutes to bootstrap -- and maybe this patch even increases the chance of those since any of the three guards can cause them.

B) Our choice of how many microdescs to fetch per directory request is really totally irrational since the network has grown so much since we chose those parameters.

Oct 11 13:32:44.779 [info] launch_descriptor_downloads(): Launching 50 requests for 4574 microdescs, 92 at a time

Seriously, we're making 50 requests and spreading them over 3 guards? I guess it's nice that we can get partial answers, but generally partial answers don't help us.

comment:12 Changed 6 years ago by nickm

Those two sound like other bugs. Shall I merge this one?

comment:13 Changed 6 years ago by arma

Resolution: fixed
Status: needs_reviewclosed

I opened #9968 and #9969 for the two other bugs. And I opened #9971 for the 0.2.5 cleanup for this bug. And I merged it. Closing. Thanks!

Note: See TracTickets for help on using tickets.