Opened 8 years ago

Closed 7 years ago

#3443 closed defect (fixed)

Client with low CBT can't establish any circuits

Reported by: arma Owned by:
Priority: High Milestone: Tor: 0.2.4.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: regression tor-client MikePerry201211
Cc: mikeperry Actual Points: 9
Parent ID: Points:
Reviewer: Sponsor:

Description

I ran my Tor client on my laptop for a while in a stable place. It ended up thinking my connection is super-fast:

Jun 21 01:49:15.786 [info] circuit_build_times_parse_state(): Loaded 1000/1000 values from 136 lines in circuit time histogram
Jun 21 01:49:15.786 [info] circuit_build_times_get_xm(): Xm mode #0: 575 55
Jun 21 01:49:15.786 [info] circuit_build_times_get_xm(): Xm mode #1: 625 51
Jun 21 01:49:15.786 [info] circuit_build_times_get_xm(): Xm mode #2: 575 55
Jun 21 01:49:15.786 [notice] Based on 1000 circuit times, it looks like we don't need to wait so long for circuits to finish. We will now assume a circuit is too slow to use after waiting 2 seconds.
Jun 21 01:49:15.786 [info] circuit_build_times_set_timeout(): Circuit timeout data: 2025.978516ms, 60000.000000ms, Xm: 590, a: 1.304577, r: 0.199000

But then I restarted my Tor, this time with a bridge that is in .za:

Jun 21 01:49:16.968 [debug] onion_pick_cpath_exit(): Launching a one-hop circuit for dir tunnel.
...
Jun 21 01:49:16.968 [notice] Bootstrapped 5%: Connecting to directory server.
...
Jun 21 01:49:16.968 [debug] circuit_handle_first_hop(): connecting in progress (or finished). Good.
...
Jun 21 01:49:17.256 [debug] connection_or_finished_connecting(): OR connect() to router at 196.x.x.x:10000 finished.
Jun 21 01:49:17.256 [notice] Bootstrapped 10%: Finishing handshake with directory server.
...
Jun 21 01:49:17.851 [debug] connection_tls_continue_handshake(): Done with initial SSL handshake (client-side). Requesting renegotiation.
...
Jun 21 01:49:18.448 [debug] connection_tls_finish_handshake(): tls handshake with 196.x.x.x done. verifying.
...
Jun 21 01:49:18.741 [info] command_process_versions_cell(): Negotiated version 2 with 196.x.x.x:10000; sending NETINFO.
...
Jun 21 01:49:18.741 [debug] circuit_send_next_onion_skin(): First skin; sending create cell.
Jun 21 01:49:18.741 [notice] Bootstrapped 15%: Establishing an encrypted directory connection.
...
Jun 21 01:49:18.969 [info] circuit_expire_building(): Abandoning circ 196.x.x.x:10000:54967 (state 0:doing handshakes, purpose 5)
Jun 21 01:49:18.969 [info] exit circ (length 1, last hop 0000000000000000000000000000000000000000): $0000000000000000000000000000000000000000(waiting for keys)
Jun 21 01:49:18.969 [info] circuit_build_failed(): Our circuit failed to get a response from the first hop (196.x.x.x:10000). I'm going to try to rotate to a better connection.
Jun 21 01:49:18.969 [info] connection_ap_fail_onehop(): Closing one-hop stream to '$0000000000000000000000000000000000000000/196.x.x.x' because the OR conn just failed.
...
Jun 21 01:49:18.970 [info] connection_dir_request_failed(): Giving up on directory server at '196.x.x.x'; retrying
...
Jun 21 01:49:27.978 [info] should_delay_dir_fetches(): delaying dir fetches (no running bridges known)

And now my Tor client is dead in the water.

Child Tickets

TicketStatusOwnerSummaryComponent
#6743closedDon't set begindir_cutoff so lowCore Tor/Tor

Attachments (2)

2012-04-11-01-tor-fuckup-logs.txt (3.7 KB) - added by rransom 7 years ago.
Graphs - LAN to Dialup.jpg (39.7 KB) - added by ioneyes 7 years ago.

Download all attachments as: .zip

Change History (54)

comment:1 Changed 8 years ago by arma

We should ponder both short-term bandaids and long-term better designs.

One bandaid would be to dump our cbt estimates if we switch from non-bridges to bridges (or maybe vice versa). But it's not clear from the state file whether we were using bridges at the time, so that's not a very easy hack.

Another bandaid would be to give up on our cbt value (set it back to default) and retry our bridges, if our last bridge just got marked as down due to abandoning a circuit. That might actually be a pretty good fix.

Other options?

comment:2 Changed 7 years ago by nickm

Keywords: regression added

comment:3 Changed 7 years ago by rransom

This bug (Tor gives up on fetching directory information entirely if its directory-fetch circuit times out) killed one of my TBBs (without bridges).

comment:4 Changed 7 years ago by mikeperry

Cc: mikeperry added

See path-spec.txt.

The consensus params that govern this are cbtmaxtimeouts and cbtrecentcount. If cbtmaxtimeouts timeouts happen within the past cbtrecentcount circuits, we give up on CBT.

Right now, the values are 18 and 20 respectively, based on statistical analysis of CBT data and the frequency of reset under normal conditions. It takes a while to build 20 circuits, but if that number is lower, we tend to end up giving up on CBT when we shouldn't, based on statistics.

But, I didn't prove any of this, I just ran a bunch of test data and picked values that ended up not resetting too often. If 18/20 is too much, we can try lowering them.

comment:5 Changed 7 years ago by mikeperry

Of course, there could be a regression that breaks all of this, especially wrt bridges. I did not test with bridges. IIRC, they did not exist when I wrote the code.

comment:6 Changed 7 years ago by arma

In the situations we've seen so far, especially in the bridge case, we do not reach 20 circuits (or even 18) before going dead.

In the extreme case of one bridge, we make one circuit, fail it, and fall dead.

I'm increasingly thinking we should turn cbt learning off by default: it produces a slow trickle of users in #tor who have been bitten by the bug (their Tor was fast for a while, then suddenly it won't work and the only fix is flushing their state file). The other option is tuning it better, but to do that correctly we need to run it in all possible environments, and it's increasingly becoming clear that we have only looked at a tiny subset of the environments people have.

comment:7 in reply to:  5 Changed 7 years ago by arma

Replying to mikeperry:

Of course, there could be a regression that breaks all of this, especially wrt bridges. I did not test with bridges. IIRC, they did not exist when I wrote the code.

Bridges went into 0.2.0.3-alpha. CBT went into 0.2.2.2-alpha.

comment:8 Changed 7 years ago by mikeperry

The CBT function is called circuit_build_times_network_check_changed(). Upon reviewing it, the comments tell me that it only counts circuits that complete a first hop against the timeout. The source seems to be consistent with the comments.

If your bridge was so slow the first hop didn't ever complete (which the logs seem to say happened at least once), that could be a problem.

We used to have a lot more cases to handle first hop failures along with general network disconnect, but we decided to simplify it, because the edge cases were a nightmare.

If we want to do a hack to abandon CBT if all guards change, it seems like circuit_build_times_network_check_changed() would be an OK place to add such a check.

comment:9 in reply to:  8 ; Changed 7 years ago by arma

Replying to mikeperry:

If your bridge was so slow the first hop didn't ever complete (which the logs seem to say happened at least once), that could be a problem.

The problem in this ticket got worse when we introduced v3 handshakes -- there's now a lot that goes on before we can finish the first hop.

In fact, here's a scenario I think happens pretty commonly: Tor client on fast connection starts, establishes tls connections to its guards, builds a bunch of circuits over time, and picks a cbt of a few seconds. Then it restarts, uses the same cbt, but doing a v3 handshake takes too long and it gives up.

comment:10 Changed 7 years ago by mikeperry

Hrmm. I know even less about the v3 handshake than everyone else seems to know about CBT. It looks like we added 2 or 3 more round trips to TLS connection setup? I guess that would do it...

We could just hack the timeout code not to start counting time until a TLS connection exists. Would that solve everything? Or do we still want to add in the "abandon the timeout data if all guards change" check to circuit_build_times_network_check_changed()?

comment:11 in reply to:  9 ; Changed 7 years ago by rransom

Replying to arma:

The problem in this ticket got worse when we introduced v3 handshakes -- there's now a lot that goes on before we can finish the first hop.

The problem in this ticket is present in Tor 0.2.2.x, which does not support the v3 link protocol.

comment:12 in reply to:  11 ; Changed 7 years ago by mikeperry

Replying to rransom:

The problem in this ticket is present in Tor 0.2.2.x, which does not support the v3 link protocol.

Out of curiosity, based on what evidence? So far, all I've heard is anecdotal ranting and singular log lines pasted here and there about connection failures... Again, I maintain that CBT should not leave you dead in the water for more than 18 consecutive timeouts under *any* circumstance.

Incidentally, I've been experiencing bugs that make my tor client dead in the water due to all nodes getting marked 'down' during network unavailability since before 0.2.0.x... Are we sure we're even blaming the right component?

Changed 7 years ago by rransom

comment:13 in reply to:  12 ; Changed 7 years ago by rransom

Replying to mikeperry:

Replying to rransom:

The problem in this ticket is present in Tor 0.2.2.x, which does not support the v3 link protocol.

Out of curiosity, based on what evidence?

See attached.

comment:14 in reply to:  13 ; Changed 7 years ago by mikeperry

Replying to rransom:

See attached.

Hrmm. Another case of a single timeout (or maybe two)... Not very useful for diagnosing anything.

Also looks like it was for a one hop dircon circuit, which I believe we recently decided to stop applying CBT timeouts to.

Can we please try to do some real analysis on this one? I feel like we're all just yelling at each other without bothering to think anything through...

Changed 7 years ago by ioneyes

Attachment: Graphs - LAN to Dialup.jpg added

comment:15 Changed 7 years ago by ioneyes

Tor is capable of tracking quite large ( a ratio I have seen of up to 1:10 ) differences in response time from varying networks.

To test this I ran Tor and switched in a slower network (Mobile GSM) and then a faster network (DSL) letting it settle in between switching.

The key is to use a CircuitBuildTimeout that allows for circuits to build on the slower network.

I have a graph that illusrates Tor's 'tracking' ability.

The graph, Graphs - LAN to Dialup.jpg, reveals ping responses on the left.
In Blue is the ping time seen on the GSM (slower) network and in green the DSL (faster) network.

It is clearly higher for GSM.

On the right is the observed Network Reset Timeout as the networks are switched with each other. It does seem to cope. The secret is to make sure CircuitBuiltTimeout is larger than you largest expected latency in seconds.
Obviously if the network does not allow circuit creation within the CircuitBuildTimeout there will be no circuits available to move traffic.

Or... I have got it all wrong...

comment:16 in reply to:  15 ; Changed 7 years ago by ioneyes

Replying to ioneyes:

In summary. Tor appears to be behaving as expected. A low CircuitBuildTimeout (too low for the current network) will result in circuits timing out before Tor gets a chance to build them.

From testing it seems you have to allow at least 2 seconds of circuit build time for every second of ping response time on the current network. This seems to give Tor the time it needs to build circuits.

It also gives Tor enough latitude to increase its CircuitBuildTimeout in conjunction with the pareto distribution its modelled on, when a network connection speed reset (to a slower responding network) occurs.

Tor is capable of tracking quite large ( a ratio I have seen of up to 1:10 ) differences in response time from varying networks.

To test this I ran Tor and switched in a slower network (Mobile GSM) and then a faster network (DSL) letting it settle in between switching.

The key is to use a CircuitBuildTimeout that allows for circuits to build on the slower network.

I have a graph that illusrates Tor's 'tracking' ability.

The graph, Graphs - LAN to Dialup.jpg, reveals ping responses on the left.
In Blue is the ping time seen on the GSM (slower) network and in green the DSL (faster) network.

It is clearly higher for GSM.

On the right is the observed Network Reset Timeout as the networks are switched with each other. It does seem to cope. The secret is to make sure CircuitBuiltTimeout is larger than you largest expected latency in seconds.
Obviously if the network does not allow circuit creation within the CircuitBuildTimeout there will be no circuits available to move traffic.

Or... I have got it all wrong...

comment:17 in reply to:  16 Changed 7 years ago by ioneyes

Replying to ioneyes:

I made a slight inaccuracy. You need to allocate at least 2 seconds of build time, for each hop (normally 3=6 seconds) for every 1 second of ping response time. A rough guide.

Replying to ioneyes:

In summary. Tor appears to be behaving as expected. A low CircuitBuildTimeout (too low for the current network) will result in circuits timing out before Tor gets a chance to build them.

From testing it seems you have to allow at least 2 seconds of circuit build time for every second of ping response time on the current network. This seems to give Tor the time it needs to build circuits.

It also gives Tor enough latitude to increase its CircuitBuildTimeout in conjunction with the pareto distribution its modelled on, when a network connection speed reset (to a slower responding network) occurs.

comment:18 Changed 7 years ago by ioneyes

I have included a branch that modifies slightly the resetting of network connection speed.

Tor normally increases the CircuitbBuildTimeout by doubling the longest build time.
After some testing this seemed overly aggressive.

What is more important it seems is to set a CircuitBuildTime that allows for the natural ping time of the network. As mentioned above.

You can find my branch with the changes made here:
https://github.com/grubWare/Tor-0.2.3.14-alpha-custom-networkResetLowered.git

comment:19 Changed 7 years ago by mikeperry

I think you did the branch wrong. https://github.com/grubWare/Tor-0.2.3.14-alpha-custom-networkResetLowered/commits/master has only one commit, and it seems like compilation garbage.

I don't buy your natural ping time argument. That is what CBT is doing already if you look at the math, and it's also an orthogonal discussion to the current bug. The doubling issue is probably also a different bug.

Again, I see two solutions to the current ticket:

  1. We could just hack the timeout code not to start counting time until a TLS connection exists.
  2. we add in an "abandon the timeout data if all guards change" check to circuit_build_times_network_check_changed()

Do we want one or both? Or some unknown option 3?

comment:20 Changed 7 years ago by nickm

I think #1 is probably a fine option for Tor 0.2.3.x; do we have time to do it?

comment:21 Changed 7 years ago by arma

Milestone: Tor: 0.2.3.x-finalTor: 0.2.2.x-final
Summary: Client with low CBT can't switch to a bridge if it's slowClient with low CBT can't establish any circuits

Retargetting the bug as a client, rather than bridge, bug. It also affects actual 0.2.2.35 users in the wild.

comment:22 in reply to:  20 ; Changed 7 years ago by mikeperry

Replying to nickm:

I think #1 is probably a fine option for Tor 0.2.3.x; do we have time to do it?

Well the quick hack is just to move the call tor_gettimeofday(&circ->timestamp_created); from init_circuit_base() to the point in the connection state machine where the connection is complete. Not sure right now exactly where that is, but I imagine somewhere in connection_or.c? Possibly connection_tls_finish_handshake()? Or is there a later state we'd want to target instead? Of course, there's also the need to iterate over all pending circs for that orconn in that function..

Once that is done, we'd just add an extra check in circuit_expire_building() and maybe also circuit_expire_old_circuits_clientside() to ignore destroying circuits without an established ORCONN..

However, the question then arises: When do we actually decide to give up on an ORCONN connection attempt? Do we just let the OS decide? How many OSs will just keep waiting forever if they get neither SYNC/ACK or RST, or if the TLS server data never arrives? As a hack, we can just use the 95th percentile cbt->close_ms if we need any random value that is not forever.

comment:23 in reply to:  22 Changed 7 years ago by mikeperry

Replying to mikeperry:

Replying to nickm:

I think #1 is probably a fine option for Tor 0.2.3.x; do we have time to do it?

Well the quick hack is just to move the call tor_gettimeofday(&circ->timestamp_created); from init_circuit_base() to the point in the connection state machine where the connection is complete. Not sure right now exactly where that is, but I imagine somewhere in connection_or.c? Possibly connection_tls_finish_handshake()? Or is there a later state we'd want to target instead? Of course, there's also the need to iterate over all pending circs for that orconn in that function..

Clarification: You still need to leave the timestamp in init_circuit_base(), to handle the case where there already is an existing orconn ready to go.

Once that is done, we'd just add an extra check in circuit_expire_building() and maybe also circuit_expire_old_circuits_clientside() to ignore destroying circuits without an established ORCONN..

However, the question then arises: When do we actually decide to give up on an ORCONN connection attempt? Do we just let the OS decide? How many OSs will just keep waiting forever if they get neither SYNC/ACK or RST, or if the TLS server data never arrives? As a hack, we can just use the 95th percentile cbt->close_ms if we need any random value that is not forever.

comment:24 Changed 7 years ago by nickm

Milestone: Tor: 0.2.2.x-finalTor: 0.2.3.x-final

I'm not a fan of changing what "timestamp_created" means if there is anything else that uses it; I'd rather introduce a new timestamp with the right semantics and set/look at that instead.

ORConn timeouts are already implemented, I believe, right?

Do you have time to code this?

comment:25 in reply to:  24 ; Changed 7 years ago by mikeperry

Replying to nickm:

I'm not a fan of changing what "timestamp_created" means if there is anything else that uses it; I'd rather introduce a new timestamp with the right semantics and set/look at that instead.

I dunno.. That also seems odd to me. If you insist, we could add a 'timestamp_began' at the point/points where CREATE and CREATE_FAST get sent, but then I think we're going to find that we will have no instances of timestamp_created left in use in that case, as everything should then switch to using timestamp_began. I think circuit timeout and build time measurement should all behave uniformly, and not sometimes have TLS connection setup overhead factored in, and sometimes not.

How long it takes you to build an ORConn should be a separate problem from how long a circuit takes to get set up.

ORConn timeouts are already implemented, I believe, right?

No idea, man. Never even really looked at that code.

Do you have time to code this?

Probably not this month. :/

comment:26 in reply to:  25 Changed 7 years ago by nickm

Replying to mikeperry:

Replying to nickm:

I'm not a fan of changing what "timestamp_created" means if there is anything else that uses it; I'd rather introduce a new timestamp with the right semantics and set/look at that instead.

I dunno.. That also seems odd to me. If you insist, we could add a 'timestamp_began' at the point/points where CREATE and CREATE_FAST get sent, but then I think we're going to find that we will have no instances of timestamp_created left in use in that case, as everything should then switch to using timestamp_began. I think circuit timeout and build time measurement should all behave uniformly, and not sometimes have TLS connection setup overhead factored in, and sometimes not.

If that's the case, we should rename the field. "created" doesn't mean "was able to sent its first create cell".

Looks like the field is used in control.c, in rephist.c, and in circ*.c. A fair bit to audit; probably not totally impossible.

comment:27 Changed 7 years ago by nickm

Roger wants this fixed in 0.2.3.x, but I don't see how that's going to happen; it's pretty fragile code. If there is a simple, obvious fix we can apply it, but otherwise I don't see how we can do it safely. Maybe if there's a pretty-simple, nearly-obvious fix that we can allow to percolate in 0.2.4.x?

comment:28 in reply to:  14 Changed 7 years ago by arma

Replying to mikeperry:

Also looks like it was for a one hop dircon circuit, which I believe we recently decided to stop applying CBT timeouts to.

I believe this 'decided' was done transiently on irc, and the only record of it is this comment.

I've opened #6743 for what I hope is the least invasive version of this idea.

comment:29 Changed 7 years ago by mikeperry

Milestone: Tor: 0.2.3.x-finalTor: 0.2.4.x-final

I also agree the timestamp changes Nick and I discussed should at least bake for a while in the alpha series. I still think they're important for making CBT and circuit timeout in general behave more uniformly.

comment:30 Changed 7 years ago by nickm

Keywords: tor-client added

comment:31 Changed 7 years ago by nickm

Component: Tor ClientTor

comment:32 Changed 7 years ago by arma

Another hack here might be to not time-out the circuit if you don't have any circuits open right now.

(We just encountered this bug during our safer red-team analysis. Sad-face.)

comment:33 in reply to:  32 ; Changed 7 years ago by mikeperry

Replying to arma:

Another hack here might be to not time-out the circuit if you don't have any circuits open right now.

Sounds like a decent hack. We should kick out a notice (or warn) logline when we decide to ignore the timeout cutoff, though. Perhaps also with some stats about opened ORConns and maybe Guard/Bridge reachability?

(We just encountered this bug during our safer red-team analysis. Sad-face.)

Do you know if it was because they changed guards/bridges, because TLS connections were being throttled, or if it was some other mystery factor?

comment:34 in reply to:  33 ; Changed 7 years ago by arma

Replying to mikeperry:

(We just encountered this bug during our safer red-team analysis. Sad-face.)

Do you know if it was because they changed guards/bridges, because TLS connections were being throttled, or if it was some other mystery factor?

They ran their client using Tor-and-stegotorus-and-a-bridge, and then they switched to Tor-and-stegotorus-and-a-transparent-redirect-to-a-bridge.

They're on the local deter testbed, so their client became convinced that 1500ms was a good timeout.

comment:35 in reply to:  34 Changed 7 years ago by mikeperry

Replying to arma:

Replying to mikeperry:

(We just encountered this bug during our safer red-team analysis. Sad-face.)

Do you know if it was because they changed guards/bridges, because TLS connections were being throttled, or if it was some other mystery factor?

They ran their client using Tor-and-stegotorus-and-a-bridge, and then they switched to Tor-and-stegotorus-and-a-transparent-redirect-to-a-bridge.

They're on the local deter testbed, so their client became convinced that 1500ms was a good timeout.

Hrmm. I imagine the redirect didn't add measurable extra latency, nor did the bridge actually change..

Sounds like that leaves connection setup as the most likely culprit, still? In which case, this could still be fixed by the changes mentioned in https://trac.torproject.org/projects/tor/ticket/3443#comment:22 and followups.

I do like also deploying the hack you mention, too, just in case there are other issues. We definitely want that logline to be very descriptive of the tor client state, though.

comment:36 Changed 7 years ago by mikeperry

Actual Points: 3
Keywords: MikePerry201210 added
Status: newneeds_review

Ok, I have an untested fix in mikeperry/bug3443 that tries to do the "don't count the orconn setup phase of circuit build time" as well as the "If no opened circuits exist, be more patient with timeouts until one does".

Aside from being untested, this definitely needs review because I'm not sure I am checking the right state variables for "is the circuit really opened" and I'm also not sure I set the timestamp in the right place and checked the right circuit state properties there.

comment:37 Changed 7 years ago by nickm

I think this branch is actually called (typowise) "bug3433."

quick review, needs more review and testing. My comments might be off-base; I'm doing this quick so that you have _some_ feedback before the storm hits me. Sorry for what I miss and get wrong here.

  • Needs a changes file.
  • Comments shouldn't say stuff like "/* Hack for additional edge-cases of #3443 */". Instead, we explain what the edge cases are and what bug is handled if you don't handle them, or else our code becomes impossible to follow without reading along with the bugtracker.
  • Don't compare timevals with memcmp; it isn't safe. They can have uninitialized RAM that isn't covered by the fields. You really need to do timercmp instead.
  • It seems that if this "relaxed timeout" thing triggers, it's going to hit the logs a lot. Should that be a notice?
  • What's with the "XXX Need to update this..." ?

For your questions:

Aside from being untested, this definitely needs review because I'm not sure I am checking the right state variables for "is the circuit really opened" and I'm also not sure I set the timestamp in the right place and checked the right circuit state properties there.

I can try to review more for consistency, and Andrea can too, but it would help a lot if you can spell out which invariants and properties you want to make sure are maintained by this code.

comment:38 in reply to:  37 Changed 7 years ago by mikeperry

Replying to nickm:

I think this branch is actually called (typowise) "bug3433."

Doh. Pushed update to bug3443.

  • Don't compare timevals with memcmp; it isn't safe. They can have uninitialized RAM that isn't covered by the fields. You really need to do timercmp instead.

I've converted this into a direct check against the cpath state.

  • What's with the "XXX Need to update this..." ?

Technically we're updating the timestamp slightly early, but since we were already updating the created timestamp there, I decided to put both updates in the same place. It shouldn't matter anyway for this actually, since the orconn is already open and active. I've tried to clarify this comment.

For your questions:

Aside from being untested, this definitely needs review because I'm not sure I am checking the right state variables for "is the circuit really opened" and I'm also not sure I set the timestamp in the right place and checked the right circuit state properties there.

I can try to review more for consistency, and Andrea can too, but it would help a lot if you can spell out which invariants and properties you want to make sure are maintained by this code.

My main concerns are with my usage of cpath->state and has_opened (which I use to check for pending first hop/orconn and opened circuits, respectively). I've also tried to improve the comments there.

Everything else you mentioned should be fixed in mikeperry/bug3443. I'll be testing this in combination with #7157 soon.

comment:39 Changed 7 years ago by mikeperry

Actual Points: 35

comment:40 Changed 7 years ago by mikeperry

I've been running a test branch with this for a couple days now. The core functionality of CBT seems the same, which is good. We're still allowing pretty much exactly 80.0% of the circuits to complete with this change.

However, arma's hack ("No circuits are opened. Relaxing timeout") has been triggered twice over the course of around 5000 circuits. In each case, the victim->n_chan (ORconn) was in state "opened", and I had not changed anything about my network as far as I know.. I guess it is a good idea to keep it at info.

If we're OK with this rate of false hack application, I think this branch is ready for another review and potentially a merge.

comment:41 Changed 7 years ago by mikeperry

Reminder: I fixed the branch typo from last time. The branch to review really is mikeperry/bug3443 this time.

comment:42 Changed 7 years ago by mikeperry

Hrmm.. On further consideration, I think I want to limit arma's hack some more. It's actually going to be a quite common situation that a user will walk away from their computer, Tor will decide to stop building circuits, and then when they come back, it's going to use a slower circuit than it should with probability of something like (0.20/available_circuits).

We should avoid actually using a circuit that has its relaxed_timeout bit set if other circuits are available... I wonder if there are other ways to avoid this situation, too?

comment:43 Changed 7 years ago by mikeperry

Keywords: MikePerry201211 added; MikePerry201210 removed
Status: needs_reviewneeds_revision

I'll add in this relaxed_timeout check somewhere in circuituse.c before this needs to be reviewed again.

comment:44 in reply to:  43 Changed 7 years ago by mikeperry

Actual Points: 56
Status: needs_revisionneeds_review

Replying to mikeperry:

I'll add in this relaxed_timeout check somewhere in circuituse.c before this needs to be reviewed again.

Ok, this is done. In mikeperry/bug3443, I added a check to circuit_is_better() to prefer circuits without relaxed_timeout set. I assume that's the only place such a check is needed?

I also pushed a commit to ensure we actually count such "relaxed" timeouts as CBT timeouts, so that CBT will reset the timeout value to an unlearned state if we end up having to relax and/or timeout a ton of circuits in a row.

comment:45 Changed 7 years ago by nickm

Dumb question:

  • I can't see what actually relaxes the timeout when relaxed_timeout is set. It appears that the first time through the circuit_expire_building loop, we set relaxed_timeout , but the next time through the loop, we'll see that relaxed_timeout is already set, and call the "No circuits are opened" log message", and go on. What's supposed to be happening here? I thought the logic was that a relaxed_timeout circuit should get a more lax timeout, but it seems like everybody's getting the same close_cutoff.

Should get fixed, not necessarily before merge:

  • The new while loop in circuit_expire_building uses a silly form. It's not necessary to have all this "victim" business when you're not modifying the list of circuits-- the only reason to use that pattern is if you think you might be removing "victim" from the list.
  • In the later while loop in circuit_expire_building, you're checking whether TO_ORIGIN_CIRCUIT(victim)->cpath is set, but earlier in the function, you're doing "if (TO_ORIGIN_CIRCUIT(victim)->cpath->state == CPATH_STATE_CLOSED) {"
  • Describe_status_for_controller probably wants to take the created time, so as not to change its behavior.
  • The documentation for timestamp_began and timestamp_created should explain how they're different.

Other than that, if it's testing ok, I say it's mergeable.

comment:46 in reply to:  45 ; Changed 7 years ago by mikeperry

Replying to nickm:

Dumb question:

  • I can't see what actually relaxes the timeout when relaxed_timeout is set. It appears that the first time through the circuit_expire_building loop, we set relaxed_timeout , but the next time through the loop, we'll see that relaxed_timeout is already set, and call the "No circuits are opened" log message", and go on. What's supposed to be happening here? I thought the logic was that a relaxed_timeout circuit should get a more lax timeout, but it seems like everybody's getting the same close_cutoff.

The logic is "If there are no opened circuits, allow *every* circuit to wait until the close_cutoff, but mark each one that we decide to do this for, so we can both avoid using them later unless we have to, and also do some CBT bookkeeping. As soon as there are *any* opened circuits, we can stop doing this relaxed timeout stuff and even let all the unbuilt relaxed circuits we were waiting on expire immediately."

However, I've decided to change the "any_opened_circuits" check to be "Are there any opened 3 hop circuits that we're not trying to cannibalize?" Otherwise, I think we're going to count open 1-hop and other weird circuits to mean we shouldn't relax, which is not quite right.

I've also upgraded the log message where a relaxed circuit *still* isn't opened by the close_cutoff to a notice. If we get all the way to that point with no circuits opened, it might mean more trouble/bugs.

Should get fixed, not necessarily before merge:

  • The new while loop in circuit_expire_building uses a silly form. It's not necessary to have all this "victim" business when you're not modifying the list of circuits-- the only reason to use that pattern is if you think you might be removing "victim" from the list.
  • In the later while loop in circuit_expire_building, you're checking whether TO_ORIGIN_CIRCUIT(victim)->cpath is set, but earlier in the function, you're doing "if (TO_ORIGIN_CIRCUIT(victim)->cpath->state == CPATH_STATE_CLOSED) {"
  • Describe_status_for_controller probably wants to take the created time, so as not to change its behavior.
  • The documentation for timestamp_began and timestamp_created should explain how they're different.

All of these should be fixed in the latest commit to mikeperry/bug3443.

The doc update also made me realize we should kill the timestamp_created update during cannibalize. We should only update timestamp_began there.

Other than that, if it's testing ok, I say it's mergeable.

Should I squash it down? Should this all be just one commit?

As a heads up, I also plan to rebase my other two branches (#7157 and #7341) to be on top of this one, to make testing all of them easier (since all three can be exercised with the same test).

comment:47 Changed 7 years ago by mikeperry

Actual Points: 68

comment:48 in reply to:  46 Changed 7 years ago by nickm

Replying to mikeperry:

Replying to nickm:

Dumb question:

  • I can't see what actually relaxes the timeout when relaxed_timeout is set. It appears that the first time through the circuit_expire_building loop, we set relaxed_timeout , but the next time through the loop, we'll see that relaxed_timeout is already set, and call the "No circuits are opened" log message", and go on. What's supposed to be happening here? I thought the logic was that a relaxed_timeout circuit should get a more lax timeout, but it seems like everybody's getting the same close_cutoff.

The logic is "If there are no opened circuits, allow *every* circuit to wait until the close_cutoff, but mark each one that we decide to do this for, so we can both avoid using them later unless we have to, and also do some CBT bookkeeping. As soon as there are *any* opened circuits, we can stop doing this relaxed timeout stuff and even let all the unbuilt relaxed circuits we were waiting on expire immediately."

I'm really confused here; I don't see how the code is doing that. In particular, I'm looking at the part that starts "if (!any_opened_circuits)." I'll read it again until it makes sense once I'm back online later.

However, I've decided to change the "any_opened_circuits" check to be "Are there any opened 3 hop circuits that we're not trying to cannibalize?" Otherwise, I think we're going to count open 1-hop and other weird circuits to mean we shouldn't relax, which is not quite right.

seems sensible.

Other than that, if it's testing ok, I say it's mergeable.

Should I squash it down? Should this all be just one commit?

No; I'll do squashing as needed before I merge.

As a heads up, I also plan to rebase my other two branches (#7157 and #7341) to be on top of this one, to make testing all of them easier (since all three can be exercised with the same test).

Okay, but if this one gets squashed before merge, you'll need to rebase them back onto master.

comment:49 Changed 7 years ago by mikeperry

Actual Points: 89

Hrmm. Perhaps we can talk about this more on IRC or something? The key to the !any_opened_circuits is the continue vs the fallthough... The continue saves the circ, but fallthough times it out and kills it. Remember, the relaxed_timeout flag is informational. It doesn't govern timeout selection.

comment:50 Changed 7 years ago by nickm

Okay, I think I get it now. Wow, that's a complicated function. (A look over at https://trac.torproject.org/projects/tor/attachment/ticket/6313/top_offenders.txt suggests that it could be the 11th-nastiest function in all of Tor.)

I now think it's probably right. I'll squash and merge it once I've slept and had coffee.

comment:51 in reply to:  50 Changed 7 years ago by nickm

Replying to nickm:

Okay, I think I get it now. Wow, that's a complicated function. (A look over at https://trac.torproject.org/projects/tor/attachment/ticket/6313/top_offenders.txt suggests that it could be the 11th-nastiest function in all of Tor.)

(To be clear, I'm not calling you out: it is a tricky function as-is in master right now, without this patch.)

comment:52 Changed 7 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

Okay, merged it. Thanks!

Note: See TracTickets for help on using tickets.