Opened 3 months ago

Last modified 3 weeks ago

#23681 merge_ready defect

prop224: Clients mark intro circs as timed-out within seconds

Reported by: asn Owned by: dgoulet
Priority: High Milestone: Tor: 0.3.1.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: prop224, tor-hs, 031-backport, 030-backport
Cc: mikeperry Actual Points:
Parent ID: Points:
Reviewer: mikeperry Sponsor:

Description

I noticed that my prop224 client had some reconnects and I did some log digging. The original disconnect was caused by a truncate cell (probably natural causes) but then a whole dance of introduction/rendezvous started which ended up marking 3 intro circs as timed out within seconds:

<FIRST TIMED-OUT INTRO CIRC>
Sep 28 03:55:26.000 [notice] Introduction circuit 2276964442 has opened. Attaching streams.
Sep 28 03:55:26.000 [info] connection_ap_handshake_attach_circuit(): Intro (2276964442) and rend (2943321454) circs are not both ready. Stalling conn. (2 sec old)
Sep 28 03:55:29.000 [info] circuit_expire_building(): Marking circ 2276964442 (state 4:open, purpose 6) as timed-out HS circ
<SECOND TIMED-OUT INTRO CIRC>
Sep 28 03:55:29.000 [info] circuit_get_best(): There is an intro circuit being created right now, but it has already taken quite a while. Starting one in parallel.
Sep 28 03:55:29.000 [notice] Introduction circuit 4126528467 has opened. Attaching streams.
Sep 28 03:55:29.000 [info] connection_ap_handshake_attach_circuit(): Intro (4126528467) and rend (2943321454) circs are not both ready. Stalling conn. (5 sec old)
Sep 28 03:55:34.000 [info] circuit_expire_building(): Marking circ 4126528467 (state 4:open, purpose 6) as timed-out HS circ
<THIRD TIMED-OUT INTRO CIRC>
Sep 28 03:55:29.000 [info] circuit_get_best(): There is an intro circuit being created right now, but it has already taken quite a while. Starting one in parallel.
Sep 28 03:55:34.000 [notice] Introduction circuit 4282777781 has opened. Attaching streams.
Sep 28 03:55:34.000 [info] connection_ap_handshake_attach_circuit(): Intro (4282777781) and rend (2943321454) circs are not both ready. Stalling conn. (10 sec old)
Sep 28 03:55:39.000 [info] circuit_expire_building(): Marking circ 4282777781 (state 4:open, purpose 6) as timed-out HS circ
<CIRCUITS CLOSE>
Sep 28 03:57:26.000 [info] circuit_expire_building(): Abandoning circ 232 137.205.124.35:1720:2276964442 (state 1,4:open, purpose 6, len 4)
Sep 28 03:57:26.000 [info] circuit_mark_for_close_(): Circuit 2276964442 (id: 232) marked for close at src/or/circuituse.c:820 (orig reason: 10, new reason: 0)
Sep 28 03:57:30.000 [info] circuit_expire_building(): Abandoning circ 231 137.205.124.35:1720:4126528467 (state 1,4:open, purpose 6, len 4)
Sep 28 03:57:30.000 [info] circuit_mark_for_close_(): Circuit 4126528467 (id: 231) marked for close at src/or/circuituse.c:820 (orig reason: 10, new reason: 0)
Sep 28 03:57:35.000 [info] circuit_expire_building(): Abandoning circ 233 137.205.124.35:1720:4282777781 (state 1,4:open, purpose 6, len 4)
Sep 28 03:57:35.000 [info] circuit_mark_for_close_(): Circuit 4282777781 (id: 233) marked for close at src/or/circuituse.c:820 (orig reason: 10, new reason: 0)

You can see that we went ahead and marked intro circs as timed-out within 3 seconds of launching them which makes no sense, and caused a whole lot of mess in the prop224 state machine.

We should try to figure out if there is a bug here.

Child Tickets

Attachments (1)

weird_ips.log.gz (30.3 KB) - added by asn 3 months ago.

Download all attachments as: .zip

Change History (18)

comment:1 Changed 3 months ago by dgoulet

Yah, the expire time of building HS circuits are different timings from the rest.

  /* Intro circs have an extra round trip (and are also 4 hops long) */
  SET_CUTOFF(c_intro_cutoff, get_circuit_build_timeout_ms() * (14/6.0) + 1000);

Basically, the build timeout of what I've observed on a "normal service connection" is usually around 2s. This line in your logs will tell you the latest value:

[info] circuit_build_times_set_timeout(): Set circuit build timeout to 2s (1500.000000ms, 60000.000000ms, Xm: 404, a: 2.882408, r: 0.031000) based on 1000 circuit times

Thus taking that into consideration, the cutoff should be 5666msec or ~5.6sec but in your case it seems the build circuit timeout was way lower that is around 1sec or even below... Those were introduced in commit 3458d904f62b2d97 so Mike is probably the person to talk to about this. There is this #3733 ticket that touches this expire building function that I need to sit down with Mike and go over all those timeouts and try to make sure they work well, the conditions in that function are good for both v2 and v3.

If you client becomes slower and slower at building circuit, the timeout should change gradually but if the first 1000 circuits were super fast and then you move to the airport shitty Wifi for instance, ramping up that timeout value will take your client a lot of failures...

But all in all, this shouldn't break the state of the client? It seems your client never got the ACK from the introduction point so it is just looping on trying to get it with new circuits... :S

Changed 3 months ago by asn

Attachment: weird_ips.log.gz added

comment:2 Changed 3 months ago by asn

Attaching info logs. See timestamps above and grep for [notice] to detect the right place!

And yes the cbt is at 2s...

Last edited 3 months ago by asn (previous) (diff)

comment:3 Changed 3 months ago by dgoulet

Oh ok... so the the RP circuit got stuck for a while waiting for the RENDEZVOUS_ESTABLISHED which stalls the client to send the INTRO1 cell and it took so long that many intro point were opened and closed while waiting on that RP...

See the RP circuit 2943321454 ... I can't tell when that was opened nor if a establish rdv has been sent!...

comment:4 Changed 2 months ago by asn

Status: newneeds_information

comment:5 Changed 7 weeks ago by dgoulet

Owner: set to dgoulet
Status: needs_informationaccepted

comment:6 Changed 7 weeks ago by dgoulet

Cc: mikeperry added
Status: acceptedneeds_information

(CCing Mike here to have an opinion from him)

In a nutshell, the client is waiting on the RP circuit to established and without that, the INTRO1 cell is not sent. In the meantime, the client also launches an intro circuit (for performance) so when we get the RENDEZVOUS_ESTABLISHED cell, we immediately send the INTRO1 cell on the open circuit.

However, imagine the RP circuit takes a while to build (or maybe we fail to handle a state #3733 ?), the client will expire the intro circuit and rebuild one while waiting for the RP. And seems that this dance went for a while in asn's case.

I really wonder how to fix this. What would be best, just make sure we expire the RP circuit in a reasonable amount of time or make an intro circuit expire only if no RP circuit can be found? (All this client side ofc).

comment:7 Changed 4 weeks ago by dgoulet

Status: needs_informationneeds_revision

After a discussion on IRC with mikeperry, we'll go with making circuit_expire_building() to not consider C_INTRODUCING with the build timeout because at that point the circuit is OPENED but the cell hasn't been sent yet because we could be waiting for the RP circuit to complete. Thus, we should not even try to consider it in that function and let the idle circuit timeout kick in if it takes too long.

comment:8 Changed 4 weeks ago by dgoulet

Priority: MediumHigh
Reviewer: mikeperry
Status: needs_revisionneeds_review

Branch: bug23681_032_01

Putting Mike has a reviewer since we discussed this and it can be very tricky.

comment:9 Changed 4 weeks ago by mikeperry

Status: needs_reviewneeds_revision

Code-wise this looks good. These intro circs still should be given longer than normal since they have 4 hops, but there is a case for that.

The changes file has a little wonky phrasing though. Instead of "The introduction circuit was timeout very fast while waiting" I would say "The introduction circuit was being timed out too quickly while waiting".

Other than that, this all looks good. Did you verify the idle timeout applies to intro circs? Maybe try to connect to a hidden service that has rend circuit building commented out so it rends don't complete?

comment:10 Changed 4 weeks ago by dgoulet

Status: needs_revisionneeds_review

The SocksTimeout of 120 seconds kicked in much faster than the idle circuit timeout:

[...]
Nov 16 09:17:03.886 [info] connection_ap_handshake_attach_circuit(): Intro 2980528860 (id: 6) and rend circuit 0 (id: 0) circuits are not both ready. Stalling conn. (119 sec old)
Nov 16 09:17:04.886 [notice] Tried for 120 seconds to get a connection to [scrubbed]. Giving up. (waiting for circuit)

Intro circuit (id = 6), choose this quite huge timeout of 52 minutes:

Nov 16 09:14:21.888 [info] origin_circuit_new(): Circuit 6 chose an idle timeout of 3140 based on 3026 seconds of predictive building remaining.

Seems "normal" but also quite long.

Anyway, here is the updated branch: bug23681_032_01

comment:11 Changed 4 weeks ago by mikeperry

The socks timeout is not going to close the circuit. Did it eventually close? If you have a test case, you can lower that idle timeout by setting CircuitsAvailableTimeout to like 3 (3 minutes - long enough for the socks timeout).

If the circuit does close, then this is merge ready IMO.

comment:12 Changed 4 weeks ago by mikeperry

Ugh, the other wrinkle to check is if the hidden service comes back online after the socks timeout, but while that old intro circuit is still laying around. Does Tor try to reuse the intro in that case? Does that work, or does it break?

Sorry this is so annoying to verify.

comment:13 in reply to:  12 Changed 4 weeks ago by dgoulet

Status: needs_reviewneeds_information

Replying to mikeperry:

Ugh, the other wrinkle to check is if the hidden service comes back online after the socks timeout, but while that old intro circuit is still laying around. Does Tor try to reuse the intro in that case? Does that work, or does it break?

First, no the SocksTimeout does *not* make the intro point circuit to close.

I haven't tested it but I'm pretty sure the client won't find that circuit because it is set to INTRODUCING purpose where the SOCKS request makes the client look for an INTRODUCE_ACK_WAIT and if can't find any launches a new circuit and sets it to INTRODUCING.

I need to confirm that but I do think we'll lose track of this circuit until its idle timeout of 52 minutes hehe.

comment:14 Changed 4 weeks ago by dgoulet

Status: needs_informationmerge_ready

So the circuit is *not* closed but it is re-used at the next SOCKS connection. I was mistaken, the magic of circuit_is_acceptable() returns the opened INTRODUCING circuit. So at least it is re-used.

And I confirm with CrcuitsAvailableTimeout 10 and SocksTimeout 5, the circuit gets closed short after the socks timeout.

Going in merge_ready since it confirms what Mike was asking about.

comment:15 Changed 4 weeks ago by nickm

Should we consider a backport for this, or is it too minor and/or too risky?

comment:16 in reply to:  15 Changed 3 weeks ago by dgoulet

Replying to nickm:

Should we consider a backport for this, or is it too minor and/or too risky?

I'm a bit puzzled about a backport. It has some inconvenience in terms of usability but doesn't break the HS feature per-se. I think I would be more comfortable keeping that in 032 for now so at least v3 is fixed.

I wouldn't object going up to 030 with it but definitely not in 029 for now.

comment:17 Changed 3 weeks ago by nickm

Keywords: 031-backport 030-backport added
Milestone: Tor: 0.3.2.x-finalTor: 0.3.1.x-final

Okay. So, I've rebased it onto 0.2.9 (since if I understand you right, that's as far as we might ever want to backport) and squashed it as bug23681_029_01_squashed. Only taking it in 0.3.2 and later for now though. (Merging to 0.3.2 and forward!)

(Git note: From a git convenience POV, it's easiest to have a branch that's based on the earliest convenient* version that we think we might want to backport onto. That way, I can merge the branch to 0.3.2 now, and then later either merge the branch to 0.3.1 or 0.2.9 or not.)

([*] Of course, this isn't a good idea when the code has changed a lot between branches. If there are huge conflicts, it makes sense to do separate branches.)

Note: See TracTickets for help on using tickets.