Opened 6 years ago

Closed 6 years ago

Last modified 6 years ago

#9776 closed defect (fixed)

Long-running tor client doesn't build new circuits anymore

Reported by: karsten Owned by:
Priority: Medium Milestone: Tor: 0.2.4.x-final
Component: Core Tor/Tor Version: Tor: 0.2.4.17-rc
Severity: Keywords: tor-client
Cc: andrea Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

I'm running three tor 0.2.4.17-rc clients on ferrinii for Torperf, and one of them stopped building new circuits a while ago:

Sep 19 08:28:42.000 [notice] We'd like to launch a circuit to handle a connection, but we already have 33 general-purpose client circuits pending. Waiting until some finish. [1 similar message(s) suppressed in last 600 seconds]

The process is still running and I can debug it (though I should kill and restart it at some point to make Torperf work again). I sent it -USR1 and -USR2 signals to dump stats and switch to debug log level. I'm attaching the logs since last restarting it.

Here's the torrc:

DataDirectory .
SocksPort 9021
UseEntryGuards 0
RunAsDaemon 1
Log notice file log
ControlPort 10021
CookieAuthentication 1
PidFile tor.pid

Roger suspects it has to do with 0.2.4.x using guards as directory guards and this tor not using guards.

Child Tickets

Attachments (2)

log.gz (9.6 KB) - added by karsten 6 years ago.
Log file
log.2.gz (138.1 KB) - added by karsten 6 years ago.
New log file

Download all attachments as: .zip

Change History (19)

Changed 6 years ago by karsten

Attachment: log.gz added

Log file

comment:1 Changed 6 years ago by arma

Alas, I don't actually think it has to do with the guards.

(My initial thought was "hey, I wonder if this client's guards are overloaded, because remember 0.2.4 uses its guards for dir info too". But that doesn't apply here. But that said, roughly nobody has tried 0.2.4, directory guards, and useentryguards 0, so who knows.)

comment:2 Changed 6 years ago by mikeperry

19:09 < skruffy> found reason of #9776. it happens if

run_connection_housekeeping Expiring non-open OR connection,
using connection_or_close_normally() so channel_closed() can't
to call circuit_n_chan_done(chan, 0).

comment:3 Changed 6 years ago by karsten

19:11:30 < skruffy> so circuits on pending list can't be close till the same

connection tried, but in most cases it will end with
unresolvable state from bug.

19:31:03 < skruffy> all logic of run_connection_housekeeping need to repair,

it's incompatible with chans.

comment:4 Changed 6 years ago by nickm

Cc: andrea added

Changed 6 years ago by karsten

Attachment: log.2.gz added

New log file

comment:5 Changed 6 years ago by karsten

Attached new log file.

comment:6 Changed 6 years ago by andrea

The bug here is that channel_closed() doesn't call circuit_n_chan_done() unless chan->reason_for_closing == CHANNEL_CLOSE_FOR_ERROR, so in a few cases where a channel can close from below without error like timeouts from run_connection_housekeeping() it can leak pending circuits. If skruffy means that (rather than implying that channel_closed() doesn't run at all in that case), then his analysis is correct.

I think his comment that run_connection_housekeeping() is 'incompatible with chans' is not well supported though; this seems to be a localized bug in channel_closed(). Since it looks harmless to run circuit_n_chan_done() in any case, I suggest simply removing the chan->reason_for_closing == CHANNEL_CLOSE_FOR_ERROR test.

comment:7 Changed 6 years ago by nickm

Okay. Write a patch, and walk me though why it should work, and I'll merge it?

comment:8 Changed 6 years ago by andrea

Also, SIGUSR1 should dump stats on circuits as well as connections and channels; if it did, seeing circuits with n_chan pointers that aren't on the channel list would confirm this hypothesis.

comment:9 in reply to:  8 Changed 6 years ago by nickm

Replying to andrea:

Also, SIGUSR1 should dump stats on circuits as well as connections and channels; if it did, seeing circuits with n_chan pointers that aren't on the channel list would confirm this hypothesis.

That sounds plausible for 0.2.5.

comment:10 Changed 6 years ago by andrea

Status: newneeds_review

Proposed fix in my bug9776 branch

comment:11 in reply to:  10 Changed 6 years ago by karsten

Replying to andrea:

Proposed fix in my bug9776 branch

I'm running your branch on ferrinii now. If it doesn't fix the problem, I should learn in a couple of days. (Not at all saying this shouldn't be merged now. I can reopen this ticket if I run into the problem again.) Thanks!

comment:12 Changed 6 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

Cherry-pickd to maint-0.2.4, and merged forwards. Thanks! (Please reopen if this wasn't the fix.)

comment:13 Changed 6 years ago by arma

Resolution: fixed
Status: closedreopened
<skruffy> fix for #9776 is wrong. you can't just drop circuits if chan was
already opened. if circuit will chose new chan while old still open then if
all closed before new esteblished then all circuits will be destroyed. such
case can happens for relay that process extend reqs.

<skruffy> it's racy condition for circ that didn't attached to already open
conn for some reason.
<skruffy> it's easy to destroy circuit just after extend req if no good chan.
<skruffy> so no races and less code.
<skruffy> client can ask again if need.
> you have pseudocode or something to make the suggestion clearer?
<skruffy> no, it's bad move too.

> what is special about a channel that makes us unable to do what we used to
do, back when we called them connections? or were we broken then too?
<skruffy> if channel was open no need to destroy every circ that waits for the same fpr.

<skruffy> channel_closed() need some checks before circuit_n_chan_done(chan, 0) calling.
<skruffy> check of state.

> so, that's the right fix? :)
<skruffy> like, if (chan->state =! CHANNEL_STATE_OPEN) circuit_n_chan_done(chan, 0);
<skruffy> but I'm not sure
<skruffy> with zillion new states

comment:14 Changed 6 years ago by nickm

skruffy says (if I understand correctly) that the race condition he describes above is another bug, and this can be closed.

comment:15 in reply to:  14 Changed 6 years ago by andrea

Replying to nickm:

skruffy says (if I understand correctly) that the race condition he describes above is another bug, and this can be closed.

Well, what he described there is non-change, since if channel_closed() gets called with chan->state == CHANNEL_STATE_OPEN it'll assert before it ever gets to circuit_n_chan_done() and if it's ever possible for that to happen lots of other stuff is broken too.

I'm not sure what else he's getting at, though. Possibly suggesting there can be pending circuits on the channel trying to do something on a *different* channel?

comment:16 Changed 6 years ago by cypherpunks

Resolution: fixed
Status: reopenedclosed

comment:17 Changed 6 years ago by nickm

The other bug is now #9880

Note: See TracTickets for help on using tickets.