Opened 9 years ago

Closed 9 years ago

Last modified 7 years ago

#1772 closed defect (fixed)

Counting circuit timeouts when we can't establish any TLS connection

Reported by: arma Owned by: mikeperry
Priority: High Milestone: Tor: 0.2.2.x-final
Component: Core Tor/Tor Version:
Severity: Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Running 856a36c43439da as a client

When my laptop loses its internet connection, I get this in my log:

Jul 28 22:43:09.572 [info] connection_ap_make_link(): Making internal direct tunnel to 85.17.59.169:19090 ...

Jul 28 22:43:09.572 [debug] onion_pick_cpath_exit(): Launching a one-hop circuit for dir tunnel.
Jul 28 22:43:09.572 [info] onion_pick_cpath_exit(): Using requested exit node 'privacyvpndotnet01'
Jul 28 22:43:09.572 [debug] onion_extend_cpath(): Path is 0 long; we want 1
Jul 28 22:43:09.572 [debug] onion_extend_cpath(): Chose router privacyvpndotnet01 for hop 1 (exit is privacyvpndotnet01)
Jul 28 22:43:09.572 [debug] onion_extend_cpath(): Path is complete: 1 steps long
Jul 28 22:43:09.572 [debug] circuit_handle_first_hop(): Looking for firsthop '85.17.59.169:19090'
Jul 28 22:43:09.572 [info] circuit_handle_first_hop(): Next router is privacyvpndotnet01: Not connected. Connecting.
Jul 28 22:43:09.572 [debug] connection_connect(): Connecting to "85.17.59.169":19090.
Jul 28 22:43:09.572 [debug] connection_connect(): Connection to "85.17.59.169":19090 in progress (sock 13).
Jul 28 22:43:09.572 [debug] connection_add(): new conn type OR, socket 13, address 85.17.59.169, n_conns 16.

Jul 28 22:43:20.316 [debug] connection_or_finished_connecting(): OR connect() to router at 85.17.59.169:19090 finished.
Jul 28 22:43:20.316 [debug] connection_tls_start_handshake(): starting TLS handshake on fd 13

Jul 28 22:43:41.629 [debug] onion_pick_cpath_exit(): Launching a one-hop circuit for dir tunnel.
Jul 28 22:43:41.629 [info] onion_pick_cpath_exit(): Using requested exit node 'privacyvpndotnet01'
Jul 28 22:43:41.630 [debug] onion_extend_cpath(): Path is 0 long; we want 1
Jul 28 22:43:41.630 [debug] onion_extend_cpath(): Chose router privacyvpndotnet01 for hop 1 (exit is privacyvpndotnet01)
Jul 28 22:43:41.630 [debug] onion_extend_cpath(): Path is complete: 1 steps long
Jul 28 22:43:41.630 [debug] circuit_handle_first_hop(): Looking for firsthop '85.17.59.169:19090'
Jul 28 22:43:41.630 [info] circuit_handle_first_hop(): Next router is privacyvpndotnet01: Connection in progress; waiting.
Jul 28 22:43:41.630 [debug] circuit_handle_first_hop(): connecting in progress (or finished). Good.
Jul 28 22:43:41.630 [debug] conn_write_callback(): socket 17 wants to write.
Jul 28 22:43:42.633 [debug] circuit_build_times_add_time(): Adding circuit build time 2147483646
Jul 28 22:43:42.635 [info] circuit_build_times_get_xm(): Xm mode #0: 925 23
Jul 28 22:43:42.635 [info] circuit_build_times_get_xm(): Xm mode #1: 1425 15
Jul 28 22:43:42.635 [info] circuit_build_times_get_xm(): Xm mode #2: 1025 13
Jul 28 22:43:42.635 [notice] Based on 1000 circuit times, it looks like we need to wait longer for circuits to finish. We will now assume a circuit is too slow to use after waiting 64 seconds.
Jul 28 22:43:42.635 [info] circuit_build_times_set_timeout(): Circuit timeout data: 64381.435070ms, 2148375.414446ms, Xm: 1097, a: 0.395221, r: 0.164000
Jul 28 22:43:42.636 [info] circuit_expire_building(): Abandoning circ 0 (state 2:connecting to server, purpose 13)
Jul 28 22:43:42.636 [info] exit circ (length 1, exit privacyvpndotnet01): $58F9C92572EA8BF85F38C87467CA1542AD8D6F65(closed)
Jul 28 22:43:42.636 [info] circuit_build_failed(): Our circuit died before the first hop with no connection
Jul 28 22:43:42.636 [info] connection_ap_fail_onehop(): Closing one-hop stream to '$58F9C92572EA8BF85F38C87467CA1542AD8D6F65/85.17.59.169' because the OR conn just failed.

Jul 28 22:43:42.641 [info] connection_dir_client_reached_eof(): 'fetch' response not all here, but we're at eof. Closing.
Jul 28 22:43:42.642 [debug] conn_close_if_marked(): Cleaning up connection (fd -1).
Jul 28 22:43:42.642 [info] connection_dir_request_failed(): Giving up on directory server at '85.17.59.169'; retrying

Jul 28 22:43:56.474 [debug] connection_tls_finish_handshake(): tls handshake with 85.17.59.169 done. verifying.
Jul 28 22:43:56.475 [debug] connection_or_check_valid_tls_handshake(): The certificate seems to be valid on outgoing connection with 85.17.59.169:19090

Jul 28 22:44:14.749 [debug] circuit_build_times_add_time(): Adding circuit build time 2147483646
Jul 28 22:44:14.751 [info] circuit_build_times_get_xm(): Xm mode #0: 925 23
Jul 28 22:44:14.751 [info] circuit_build_times_get_xm(): Xm mode #1: 1425 15
Jul 28 22:44:14.751 [info] circuit_build_times_get_xm(): Xm mode #2: 1025 13
Jul 28 22:44:14.751 [notice] Based on 1000 circuit times, it looks like we need to wait longer for circuits to finish. We will now assume a circuit is too slow to use after waiting 66 seconds.
Jul 28 22:44:14.751 [info] circuit_build_times_set_timeout(): Circuit timeout data: 65576.560832ms, 2223200.500411ms, Xm: 1097, a: 0.393444, r: 0.165000
Jul 28 22:44:14.751 [info] circuit_expire_building(): Abandoning circ 85.17.59.169:19090:46080 (state 0:doing handshakes, purpose 13)
Jul 28 22:44:14.752 [info] exit circ (length 1, exit privacyvpndotnet01): $58F9C92572EA8BF85F38C87467CA1542AD8D6F65(waiting for keys)
Jul 28 22:44:14.752 [info] circuit_build_failed(): Our circuit failed to get a response from the first hop (85.17.59.169:19090). I'm going to try to rotate to a better connection.
Jul 28 22:44:14.752 [debug] connection_or_send_destroy(): Sending destroy (circID 46080).

A) Why are we counting cbt timeouts for one-hop circs? I guess in circuit_expire_building() we do in fact compute begindir_cutoff as a function of circ_times.timeout_ms. Ok.

B) If the first hop fails to establish, we shouldn't necessarily be blaming our circuit build timeout. It could be that we don't have a network. But it could also be that our cbt is too small. Hm.

C) What's up with this "expire it twice" business? Are we marking the circuit to state MEASURE the first time, then when the TLS finally finishes, we're killing it again and calling circuit_build_times_count_close()?

I get lots of these TLS conn failures in quick succession when my network goes away, and CBT is punishing me for it.

Child Tickets

Change History (17)

comment:1 Changed 9 years ago by arma

Milestone: Tor: 0.2.2.x-final

comment:2 Changed 9 years ago by arma

To be fair, it looks like in this case I *did* finish the tls conn. It just took 47 seconds. But my begindir CBT kicked in at 65/2=32 seconds in to the request -- before the TLS conn had finished. Weird.

Speaking of which, we moved the timeout distribution to a three-humped distribution, to account for the three guard relays. But how do we account for the wide variety of connections we make for one-hop dir requests?

comment:3 Changed 9 years ago by nickm

Owner: set to mikeperry
Status: newassigned

Assigning to mikeperry, our resident cbt expert.

re A: Maybe we *should* only consider 3-hop (or longer) circuits when looking into timeouts?

comment:4 Changed 9 years ago by mikeperry

arma -

A). You're right. I think I am going to revert my fix for #1741 to explicitly check for 3 hop paths and only count those.

B). The logic to determine if the network is dead is present in circuit_build_times_network_close() and circuit_build_times_network_is_live(). cbt_network_is_live() is called every time we get a cell, and resets the internal counter cbt->liveness.nonlive_timeouts to 0. Every time you get to the 95th percentile on the CBT curve, circuit_build_times_count_close() is called. It calls the cbt_network_close(), which increments cbt->liveness.nonlive_timeouts. circuit_build_times_network_check_live() is the call that checks this counter, and if it gets above 3, we stop counting timeouts. If it gets above 6, we throw away the 3 previous timeouts (which were recorded, but presumably happened when the network was believed to be down).

If there is a problem in how we decide to stop counting circuits when your network goes away, it's in that pile of madness. Please have a look at it, and/or grep your logs for "Network is no longer live" and "Network is flaky" and other associated logs from those functions.

C). Yes, that's the idea. The purpose should first be non-13 in the first call to expire. It gets switched to 13, and then is allowed to build until the timeout value for the 95th percentile on the CBT curve, at which point circuit_expire_building decides to really kill it.

comment:5 Changed 9 years ago by mikeperry

Sebastian got this possibly related warn:

04:41:14.964 [warn] circuit_build_times_network_close(): Bug: Circuit somehow completed a hop while the network was not live. Network was last live at 2010-08-20 04:40:13, but circuit launched at 2010-08-20 04:40:13. It's now 2010-08-20 04:41:14.

Aug 20 04:48:21.964 [warn] circuit_build_times_network_close(): Bug: Circuit somehow completed a hop while the network was not live. Network was last live at 2010-08-20 04:47:20, but circuit launched at 2010-08-20 04:47:20. It's now 2010-08-20 04:48:21.

Although his network connection was live at the time.

comment:6 Changed 9 years ago by mikeperry

14:31 < ioerror> Aug 23 23:17:42.590 [warn]

circuit_build_times_network_close(): Bug: Circuit somehow
completed a hop while the network was not live. Network was
last live at 2010-08-23 23:16:41, but circuit launched at
2010-08-23 23:16:41. It's now 2010-08-23 23:17:42.

It is interesting that all of these are *exactly* 1 minute and 1 second after circuit launch. Perhaps the logic is only wrong for the timeout codepath(s) for these particular warns.

comment:7 Changed 9 years ago by arma

Priority: normalmajor

Triage: this bug is blocking 0.2.2.x-rc.

comment:8 Changed 9 years ago by arma

Ah ha. Mystery partially solved. Here's what I was seeing:

Sep 29 08:40:20.807 [debug] onion_pick_cpath_exit(): Launching a one-hop circuit
 for dir tunnel.
Sep 29 08:40:20.808 [info] onion_pick_cpath_exit(): Using requested exit node 'P
andora32'
Sep 29 08:40:20.808 [debug] onion_extend_cpath(): Path is 0 long; we want 1
Sep 29 08:40:20.808 [debug] onion_extend_cpath(): Chose router Pandora32 for hop 1 (exit is Pandora32)
Sep 29 08:40:20.808 [debug] onion_extend_cpath(): Path is complete: 1 steps long
Sep 29 08:40:20.808 [debug] circuit_handle_first_hop(): Looking for firsthop '91.208.34.32:443'
Sep 29 08:40:20.808 [info] circuit_handle_first_hop(): Next router is Pandora32: Connection in progress; waiting.
Sep 29 08:40:20.808 [debug] circuit_handle_first_hop(): connecting in progress (or finished). Good.
Sep 29 08:40:21.803 [debug] circuit_expire_building(): 1285764021 60000 30 30 1285763991
Sep 29 08:40:21.803 [debug] circuit_expire_building(): 1285763990 1285763991
Sep 29 08:40:21.803 [debug] circuit_build_times_add_time(): Adding circuit build time 2147483646
Sep 29 08:40:21.803 [info] circuit_expire_building(): Abandoning circ 0 (state 2:connecting to server, purpose 13)
Sep 29 08:40:21.803 [info] exit circ (length 1, exit Pandora32): Pandora32(closed)
Sep 29 08:40:21.803 [info] circuit_build_failed(): Our circuit died before the first hop with no connection
Sep 29 08:40:21.803 [info] connection_ap_fail_onehop(): Closing one-hop stream to '$DFE540DC6BE5C822223D6A2347DC9C04E75E4509/91.208.34.32' because the OR conn just failed.

And the question is "why the heck is it abandoning the circuit one second after launching it".

The first answer is that it's not, it's abandoning a circuit 30 seconds after launching it:

Sep 29 08:39:50.688 [info] update_consensus_networkstatus_downloads(): Launching ns networkstatus consensus download.
Sep 29 08:39:50.690 [debug] smartlist_choose_by_bandwidth_weights(): Choosing node for rule weight as directory based on weights Wg=0.000000 Wm=1.000000 We=0.029000 Wd=0.003900 with total bw 567996336.300000
Sep 29 08:39:50.690 [debug] directory_initiate_command_rend(): anonymized 0, use_begindir 1.
Sep 29 08:39:50.690 [debug] directory_initiate_command_rend(): Initiating consensus network-status fetch
Sep 29 08:39:50.690 [info] connection_ap_make_link(): Making internal direct tunnel to 91.208.34.32:443 ...
Sep 29 08:39:50.690 [debug] connection_add_impl(): new conn type Socks, socket -1, address (Tor_internal), n_conns 6.
Sep 29 08:39:50.690 [debug] onion_pick_cpath_exit(): Launching a one-hop circuit for dir tunnel.
Sep 29 08:39:50.690 [info] onion_pick_cpath_exit(): Using requested exit node 'Pandora32'
Sep 29 08:39:50.690 [debug] onion_extend_cpath(): Path is 0 long; we want 1
Sep 29 08:39:50.690 [debug] onion_extend_cpath(): Chose router Pandora32 for hop 1 (exit is Pandora32)
Sep 29 08:39:50.691 [debug] onion_extend_cpath(): Path is complete: 1 steps long
Sep 29 08:39:50.691 [debug] circuit_handle_first_hop(): Looking for firsthop '91.208.34.32:443'
Sep 29 08:39:50.691 [info] circuit_handle_first_hop(): Next router is Pandora32: Not connected. Connecting. 
Sep 29 08:39:50.691 [debug] connection_connect(): Connecting to "91.208.34.32":443.
Sep 29 08:39:50.691 [debug] connection_connect(): Connection to "91.208.34.32":443 in progress (sock 13).
Sep 29 08:39:50.691 [debug] connection_add_impl(): new conn type OR, socket 13, address 91.208.34.32, n_conns 7.
Sep 29 08:39:50.691 [debug] circuit_handle_first_hop(): connecting in progress (or finished). Good.

So why is the new circuit launched at 08:40:20.807? On the first go-round through circuit_expire_building() we notice that its purpose is C_GENERAL, and switch it:

      if (victim->purpose != CIRCUIT_PURPOSE_C_MEASURE_TIMEOUT) {
        victim->purpose = CIRCUIT_PURPOSE_C_MEASURE_TIMEOUT;
        circuit_build_times_count_timeout(&circ_times,
                                          first_hop_succeeded);
        continue;
      }

Then we notice that there's an outstanding directory fetch request, with no circuit to handle it, so we launch one. Then one second later, we're back in circuit_expire_building() and it's time to kill the circuit because it's purpose CIRCUIT_PURPOSE_C_MEASURE_TIMEOUT.

(This behavior only happens when I've got 100 or fewer circuit build times, since it's only then that circ_times.timeout_ms equals circ_times.close_ms.)

Sure enough, 30 seconds later,

Sep 29 08:40:51.859 [debug] circuit_build_times_add_time(): Adding circuit build time 2147483646
Sep 29 08:40:51.859 [info] circuit_expire_building(): Abandoning circ 0 (state 2:connecting to server, purpose 13)
Sep 29 08:40:51.859 [info] exit circ (length 1, exit Pandora32): Pandora32(closed)
Sep 29 08:40:51.859 [info] circuit_build_failed(): Our circuit died before the first hop with no connection

comment:9 Changed 9 years ago by arma

http://freehaven.net/~arma/tord-bug1772 is the beginning of a Tor debug log. When I wake up (and it's been running for hours) I'll post the rest of it.

Interesting points:

A) Note the frequent timeouts caused by the Pandoras. They're mostly one-hop circuits, so in theory they'll go away with the recent patches to #1740. I worry though that we'll just mask the problem then.

B) We should not do the CIRCUIT_PURPOSE_C_MEASURE_TIMEOUT dance if cbt->close_ms == cbt->timeout_ms. It just causes more timeouts.

C) There are a few instances of bug #1980 for you, complete with debug logs:

Sep 29 09:51:48.575 [warn] circuit_build_times_network_close(): Bug: Circuit somehow completed a hop while the network was not live. Network was last live at 2010-09-29 09:50:47, but circuit launched at 2010-09-29 09:50:47. It's now 2010-09-29 09:51:48.

D) I expect that when it reaches 100 build times, it will produce an atrocious cbt. We should think about how to cap the number at something realistic. I am guessing that these Pandoras are simply never planning to answer me, no matter my timeout.

comment:10 in reply to:  9 Changed 9 years ago by arma

Replying to arma:

D) I expect that when it reaches 100 build times, it will produce an atrocious cbt. We should think about how to cap the number at something realistic. I am guessing that these Pandoras are simply never planning to answer me, no matter my timeout.

How about capping it at twice the maximum circuit build time ever witnessed? That should let it grow plenty high while still preventing it from reaching the "2 million seconds" that I got yesterday.

More generally, I fear the algorithm for computing the right timeout behaves poorly in the face of even a small fraction of failing relays.

comment:11 Changed 9 years ago by mikeperry

Status: assignedneeds_review

I decided to simplify all this code. We now consider the network down if we see no cells for the duration of one full circuit measurement period. Fixes in mikeperry/bug1772.

Re your B: If the close_ms == timeout_ms, it shouldn't cause any *extra* timeouts, right? Either way, the circuit would get closed and a new one would have to be launched.. Also, gcc doesn't like == or != comparisons for floating point values, and it will warn if I do that...

Re D: The cap is a good idea. In that branch, I implemented the timeout_ms cap to be the largest build time we have observed, and the close_ms cap to be twice that.

comment:12 Changed 9 years ago by arma

Just to finish the loop, I killed that Tor and uploaded the final debug log.

Also, here are the buildtimes from my state file:

LastWritten 2010-09-29 19:39:11
TotalBuildTimes 103
CircuitBuildAbandonedCount 51
CircuitBuildTimeBin 475 1
CircuitBuildTimeBin 525 2
CircuitBuildTimeBin 575 1
CircuitBuildTimeBin 625 2
CircuitBuildTimeBin 675 1
CircuitBuildTimeBin 725 2
CircuitBuildTimeBin 775 1
CircuitBuildTimeBin 825 3
CircuitBuildTimeBin 875 3
CircuitBuildTimeBin 925 3
CircuitBuildTimeBin 1075 3
CircuitBuildTimeBin 1125 1
CircuitBuildTimeBin 1225 1
CircuitBuildTimeBin 1375 1
CircuitBuildTimeBin 1425 1
CircuitBuildTimeBin 1475 1
CircuitBuildTimeBin 1625 2
CircuitBuildTimeBin 1925 1
CircuitBuildTimeBin 2075 1
CircuitBuildTimeBin 2575 1
CircuitBuildTimeBin 2775 1
CircuitBuildTimeBin 2875 1
CircuitBuildTimeBin 3075 1
CircuitBuildTimeBin 3875 1
CircuitBuildTimeBin 4775 1
CircuitBuildTimeBin 5225 1
CircuitBuildTimeBin 5975 1
CircuitBuildTimeBin 7025 1
CircuitBuildTimeBin 7625 1
CircuitBuildTimeBin 7675 1
CircuitBuildTimeBin 8675 1
CircuitBuildTimeBin 9325 1
CircuitBuildTimeBin 32075 1
CircuitBuildTimeBin 34875 1
CircuitBuildTimeBin 36325 1
CircuitBuildTimeBin 40975 1
CircuitBuildTimeBin 41875 1
CircuitBuildTimeBin 46425 1
CircuitBuildTimeBin 46825 1
CircuitBuildTimeBin 48625 1

comment:13 in reply to:  11 Changed 9 years ago by arma

Replying to mikeperry:

Re your B: If the close_ms == timeout_ms, it shouldn't cause any *extra* timeouts, right? Either way, the circuit would get closed and a new one would have to be launched.. Also, gcc doesn't like == or != comparisons for floating point values, and it will warn if I do that...

By close_ms == timeout_ms, I meant when

cbt->total_build_times < circuit_build_times_min_circs_to_observe()

In this case, the timeout_ms and close_ms are set equal to each other, so there's no point switching the circuit over to purpose CIRCUIT_PURPOSE_C_MEASURE_TIMEOUT.

Also, there is a difference, since if we fail before the first hop, we blame the relay, kill the connection attempt, and kill all streams asking for this relay via connection_ap_fail_onehop(). Also, in the above example, we're launching a second circuit to that relay right before declaring the relay dead and giving up on the directory fetch that was destined for it -- and since the relay is actually the one at fault, the next circuit fails too and we count a second circuit timeout at 2147483646. It seems that if we know we're going to do that one second later, we should do it now, and save ourselves the trouble of wasting a circuit.

See commit ceb3d4d578f4ebb8d in arma/bug1772 for my fix.

A different approach would be for circuit_build_times_init() to set timeout_ms to circuit_build_times_get_initial_timeout(), and set close_ms to 2*that. I have a mild preference against that approach though, since it just complicates things.

comment:14 Changed 9 years ago by mikeperry

Updated mikeperry/bug1772 with comments for the liveness stuff. I did not see your casting patch in your branch, so you need to make sure that gets in still, somehow.

comment:15 Changed 9 years ago by arma

Merged.

comment:16 Changed 9 years ago by arma

Resolution: fixed
Status: needs_reviewclosed

I'm going to call this one fixed. Whatever comes up next should get a new bug number.

comment:17 Changed 7 years ago by nickm

Component: Tor ClientTor
Note: See TracTickets for help on using tickets.