Opened 9 years ago

Closed 9 years ago

Last modified 7 years ago

#1840 closed defect (fixed)

run_connection_housekeeping() closes circuits early

Reported by: arma Owned by:
Priority: High Milestone: Tor: 0.2.2.x-final
Component: Core Tor/Tor Version:
Severity: Keywords:
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Due perhaps to bug #1772, my circuitbuildtimeout is growing very big:

Aug 17 14:59:00.189 [notice] Based on 1000 circuit times, it looks like we need to wait longer for circuits to finish. We will now assume a circuit is too slow to use after waiting 255 seconds.

And once it's this big, run_connection_housekeeping() starts killing OR connections that haven't finished handshaking yet:

Aug 17 15:06:03.012 [info] run_connection_housekeeping(): Expiring non-used OR connection to fd 18 (38.229.70.33:9001) [idle 180].
Aug 17 15:06:03.012 [info] run_connection_housekeeping(): Expiring non-used OR connection to fd 11 (85.25.130.135:443) [idle 180].
Aug 17 15:06:03.012 [debug] conn_close_if_marked(): Cleaning up connection (fd 18).
Aug 17 15:06:03.012 [debug] circuit_n_conn_done(): or_conn to FordModelA/38.229.70.33, status=0
Aug 17 15:06:03.012 [info] circuit_n_conn_done(): or_conn failed. Closing circ.
Aug 17 15:06:03.012 [info] circuit_build_failed(): Our circuit died before the first hop with no connection
Aug 17 15:06:03.013 [debug] circuit_increment_failure_count(): n_circuit_failures now 1.
Aug 17 15:06:03.013 [debug] connection_remove(): removing socket 18 (type OR), n_conns now 20
Aug 17 15:06:03.013 [debug] _connection_free(): closing fd 18.
Aug 17 15:06:03.013 [debug] conn_close_if_marked(): Cleaning up connection (fd 11).
Aug 17 15:06:03.013 [debug] connection_remove(): removing socket 11 (type OR), n_conns now 19
Aug 17 15:06:03.013 [debug] _connection_free(): closing fd 11.

So the first issue is: why the heck is my network taking >180 seconds to TLS handshake with this relay? Probably something about my network stability, plus the results of bug #1772.

The second question is: why are we culling the OR connection when there's a circuit that isn't connected yet but wants to be? That sounds like a check we forgot to add (that didn't matter when the timeout was 15 minutes, but now matters when the timeout is 3 minutes).

Child Tickets

Change History (7)

comment:1 in reply to:  description Changed 9 years ago by mingw-san

Replying to arma:

The second question is: why are we culling the OR connection when there's a circuit that isn't connected yet but wants to be? That sounds like a check we forgot to add (that didn't matter when the timeout was 15 minutes, but now matters when the timeout is 3 minutes).

Seems like it was "Expiring non-open OR connection" actually, but (past_keepalive && !connection_state_is_open(conn)) never be true since (by default)IDLE_OR_CONN_TIMEOUT < options->KeepalivePeriod.

comment:2 Changed 9 years ago by nickm

The second case seems worth fixing, perhaps by making the check that triggers ("Expiring non-used OR connection to fd %d (%s:%d) [idle %d]") only happen on open connections, and letting the "Expiring non-open connection" test get 'em if they take far far too long to open.

comment:3 Changed 9 years ago by arma

Priority: normalmajor

Triage: this one is blocking 0.2.2.x. I'm not going to be the only one who gets a horrible circuit build timeout, and it will be a real pain to diagnose if it's only people in Nigeria who experience this bug once #1772 is fixed.

comment:4 Changed 9 years ago by nickm

Status: newneeds_review

Looks like the first commit to cause this problem was 67b38d50, which made both timeouts apply to non-open connections, when it apparently only meant to apply one of them. That makes this a backport candidate.

See branch "bug1840" in my public repo; it is against maint-0.2.1 and should apply fine against master too.

comment:5 Changed 9 years ago by arma

4d2e9974f91 looks good to me. Please merge.

As for your "XXXX why? I think we can remove this. -NM", I agree that it can be removed -- there should be nothing to flush for a non-open OR conn.

I think it doesn't need a backport to 0.2.1.x, since in 0.2.1.x CircuitBuildTimeout is 60 seconds, and this bug only triggered because my CBT grew beyond 180 seconds due to other 0.2.2.x bugs. On the other hand, what could a backport possibly break? ;)

comment:6 Changed 9 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

I think I'll merge to 0.2.1.x then; the old way of doing things -- where a non-open connection only had the first timeout apply to it -- held until 0.2.1.26, and I'm not convinced that the 0.2.1.26 change hasn't hurt us in other ways.

(Merged; closing.)

comment:7 Changed 7 years ago by nickm

Component: Tor ClientTor
Note: See TracTickets for help on using tickets.