Opened 9 years ago

Closed 7 years ago

Last modified 7 years ago

#1297 closed defect (implemented)

Update hidden service logic to be more resilient to timeout

Reported by: mikeperry Owned by: rransom
Priority: High Milestone: Tor: 0.2.3.x-final
Component: Core Tor/Tor Version: 0.2.2.10-alpha
Severity: Keywords: tor-hs
Cc: mikeperry, Sebastian, nickm Actual Points:
Parent ID: #2552 Points:
Reviewer: Sponsor:

Description (last modified by Sebastian)

Now that we expire 20% of our slowest circuits, there is a chance that clients may pick a rend point that
hidden services are unable to reach in 3 tries within their circuit build timeout value. This will cause the
client connection to fail.

We should look at this code and see if we can make it more resilient to timeout, or have it backoff on the
timeout value after N tries instead of giving up entirely on the connection.

[Automatically added by flyspray2trac: Operating System: All]

Child Tickets

Attachments (1)

bug-1297b-notes-2011-12-24-01.txt (2.6 KB) - added by rransom 7 years ago.

Download all attachments as: .zip

Change History (22)

comment:1 Changed 9 years ago by Sebastian

Description: modified (diff)

Is there any progress on this or are there new ideas? I believe this might be a reason for many of the hidden service connectivity problems we've been hearing from alpha users lately. Something I've experience myself was that occasionally a hidden service would take 120 seconds to time out, but a new request would very quickly succeed.

comment:2 Changed 9 years ago by nickm

Milestone: Tor: 0.2.3.x-final

Moving this to 0.2.3.x; it can be a fix on 0.2.2.x if there's a bug with a non-scary fix, though.

comment:3 Changed 8 years ago by arma

Parent ID: #2552

comment:4 Changed 8 years ago by arma

Component: Tor ClientTor hidden services

comment:5 Changed 8 years ago by rransom

Owner: changed from mikeperry to rransom
Priority: minornormal
Status: newassigned

I've seen a hidden service client time out on a rendezvous circuit, then try again with a new rendezvous circuit and introduction point, much faster than I think it should have.

comment:6 Changed 8 years ago by rransom

Milestone: Tor: 0.2.3.x-finalTor: 0.2.2.x-final
Priority: normalmajor

circuit_expire_building is impressively broken. The fix(es) will be non-scary, and should definitely be merged to maint-0.2.2.

comment:7 in reply to:  1 ; Changed 8 years ago by rransom

Status: assignedneeds_review

Replying to Sebastian:

Something I've experience myself was that occasionally a hidden service would take 120 seconds to time out, but a new request would very quickly succeed.

This problem is on the client side, not the server side (which this ticket's description focuses on). The client spends its pre-built general-purpose circuits somehow (possibly on the descriptor fetch, possibly on introduction or rendezvous circuits which immediately time out (I haven't dug thoroughly enough into the source to find out whether this happens yet)), then all of the rendezvous circuits and introduction circuits it opens time out; when the user opens a second AP connection after the first times out, the client has some pre-built circuits ready, and the introduction and rendezvous attempts succeed before the CBT code reaps those circuits.

See bug1297a ( git://git.torproject.org/rransom/tor.git bug1297a ) for fixes for some timeout-induced breakage on the client side. I suspect that this doesn't completely fix #1297 on the client side, and it doesn't even touch the hidden service side.

comment:8 Changed 8 years ago by nickm

Status: needs_reviewassigned

Looks fine to me. squashing and merging. Throwing this back out of needs_review, since iiuc you say there are more cases left.

comment:9 in reply to:  7 Changed 8 years ago by rransom

Replying to rransom:

Replying to Sebastian:

Something I've experience myself was that occasionally a hidden service would take 120 seconds to time out, but a new request would very quickly succeed.

This problem is on the client side, not the server side (which this ticket's description focuses on). The client spends its pre-built general-purpose circuits somehow (possibly on the descriptor fetch, possibly on introduction or rendezvous circuits which immediately time out (I haven't dug thoroughly enough into the source to find out whether this happens yet)),

From circuit_launch_by_extend_info, if circ is being cannibalized:

      /* reset the birth date of this circ, else expire_building
       * will see it and think it's been trying to build since it
       * began. */
      tor_gettimeofday(&circ->_base.timestamp_created);

So intro and rend circuits do not die immediately after they are obtained through cannibalism.

comment:10 Changed 8 years ago by Sebastian

What is left to do here?

comment:11 in reply to:  10 Changed 8 years ago by rransom

Replying to Sebastian:

What is left to do here?

In order to give hidden services with high circuit-build timeouts a chance of working, we need to modify the client code so that when a client's intro circ times out in state C_INTRODUCE_ACK_WAIT, the client leaves its corresponding rendezvous circuit open while it tries again with a different intro/rend circuit pair. This will require creating another state for rendezvous circuits (stored in the ‘purpose’ field).

comment:12 Changed 7 years ago by nickm

Milestone: Tor: 0.2.2.x-finalTor: 0.2.3.x-final

At this point, I'm having a hard time seeing this as having a good risk/reward ratio 0.2.2.x. Once there's code, you can try to convince me otherwise if you want.

comment:13 Changed 7 years ago by rransom

Status: assignedneeds_review

See bug1297b ( https://git.torproject.org/rransom/tor.git bug1297b ) for a not-yet-tested branch on 0.2.3.x to make clients keep HS circuits which have reached their normal CBT around longer while retrying with new intro/rend circuits.

I will need to add a configuration option to allow users to disable this new behaviour, because even though it will clearly improve HS connection-establishment performance (assuming it works correctly), I suspect that it will harm performance after the connection is established, because we will now use circuits which took longer to build. We currently do not have tools designed to test latency on already-opened circuits; when we do, we will want to investigate this further.

There is one remaining change to make for this ticket, on the service side: hidden services should be able to keep their CIRCUIT_PURPOSE_S_CONNECT_REND circuits open after they time out, while building another rendezvous circuit in parallel.

comment:14 Changed 7 years ago by rransom

The commits on bug1297b (up to c04093363803a4120bdecae82d61e357e869d1fe) do not break Tor when used in TBB. I've pushed some more commits, including a few to fix unrelated bugs; the new changes are not yet tested, and will have to be squashed and rearranged a bit.

comment:15 Changed 7 years ago by rransom

See bug1297b-v2 ( https://git.torproject.org/rransom/tor.git bug1297b-v2 ) for the rebased branch. This branch contains my bug4759-v2 branch, because these changes require that #4759 be fixed.

comment:16 Changed 7 years ago by nickm

Looks good, I think. Could I have some comments explaining what can happen to a circuit once hs_circ_has_timed_out is set on it? The current comments do a good job of explaining when the flag is set, but not the allowable transitions out of that state. (So, the idea is that a "timed out" circuit is not really timed out, but allowed to stick around a little longer in case it works, in which case we declare it to be okay?)

Why would you want to set CloseHSClientCircuitsImmediatelyOnTimeout ? Is it just there for testing, or what?

Is there any limit on how many times this code can relaunch circuits on timeout for the same request?

comment:17 in reply to:  16 ; Changed 7 years ago by rransom

Replying to nickm:

Looks good, I think. Could I have some comments explaining what can happen to a circuit once hs_circ_has_timed_out is set on it? The current comments do a good job of explaining when the flag is set, but not the allowable transitions out of that state.

OK. I'll push a comment change tomorrow.

(So, the idea is that a "timed out" circuit is not really timed out, but allowed to stick around a little longer in case it works, in which case we declare it to be okay?)

Yes, that's the idea.

Why would you want to set CloseHSClientCircuitsImmediatelyOnTimeout ? Is it just there for testing, or what?

The justification for Tor's adaptive-CBT code is that circuits which are built more quickly are also ‘faster’ after they are built. These changes will cause clients to use circuits with longer build times, in order to decrease the overall time until some circuit is connected to a hidden service. Users who connect to or host latency-sensitive hidden services (e.g. IRC) might want to set the options which disable these changes.

We will also want to use those options to test the impact of this change on performance, someday when we have a performance-measurement tool which measures the latency on an open circuit (rather than only measuring the time until a first request has completed through a Tor client with no circuits open).

Is there any limit on how many times this code can relaunch circuits on timeout for the same request?

On the client side, HS circuits are relaunched by the existing code in circuit_get_open_circ_or_launch when it does not find an ‘acceptable’ circuit to use (as defined by circuit_is_acceptable, which never considers a circuit with hs_circ_has_timed_out set acceptable). The client will continue launching circuits as long as there is an AP connection trying to connect to the hidden service and there is an intro point remaining to try for the HS. (Before the last #3825 change, clients could keep pounding an intro point for SocksTimeout seconds; now, the maximum number of intro circs is five circs per intro point.)

On the service side, rendezvous circuits are relaunched when they reach the normal timeout for four-hop circuits. The service will stop launching circuits to a client's rendezvous point after launching MAX_REND_FAILURES circuits (currently 30) or after trying to connect for MAX_REND_TIMEOUT seconds (currently 30). MAX_REND_FAILURES is too high, but I don't know what number to lower it to yet.

This change does not increase the number of circuits built for a hidden-service connection attempt; it is likely to decrease the number of circuits, by decreasing the time before a client successfully connects to a hidden service (and thus decreasing the time for which it builds new circuits for the HS connection attempt).

comment:18 in reply to:  17 Changed 7 years ago by rransom

Replying to rransom:

Replying to nickm:

Looks good, I think. Could I have some comments explaining what can happen to a circuit once hs_circ_has_timed_out is set on it? The current comments do a good job of explaining when the flag is set, but not the allowable transitions out of that state.

OK. I'll push a comment change tomorrow.

Pushed.

Changed 7 years ago by rransom

comment:19 Changed 7 years ago by nickm

Resolution: Noneimplemented
Status: needs_reviewclosed

Merging and closing. Thanks!

comment:20 Changed 7 years ago by nickm

Keywords: tor-hs added

comment:21 Changed 7 years ago by nickm

Component: Tor Hidden ServicesTor
Note: See TracTickets for help on using tickets.