Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#6475 closed defect (fixed)

circuit_send_next_onion_skin(): Bug: Unexpectedly high circuit_successes

Reported by: grarpamp Owned by: mikeperry
Priority: High Milestone: Tor: 0.2.3.x-final
Component: Core Tor/Tor Version: Tor: 0.2.3.19-rc
Severity: Keywords: MikePerry201208 tor-client
Cc: nickm, armadev, karsten, ioerror Actual Points: 13
Parent ID: Points: 8
Reviewer: Sponsor:

Description

I'm seeing hundreds of these. The current M seems to exceed the m,
for which m is increased, then M exceeds that... repeat.

ver: maint-0.2.3.x as of jul 17 2012.

[warn] circuit_send_next_onion_skin(): Bug: Unexpectedly high
circuit_successes (M/m) for guard <nick>.

There are a few accompanying entries shown below.
But note primarily that the above style uselessly lists only the <nick>,
while the below style needlessly lists it.

[notice] Low circuit success rate M/m for guard <nick>=<fp>.

Child Tickets

Change History (42)

comment:1 Changed 7 years ago by mikeperry

Cc: nickm added; nickm@… removed
Keywords: MikePerry201208 added
Owner: changed from mikeperry@… to mikeperry
Priority: normalmajor
Status: newassigned

I'm seeing hundreds of these:
[warn] circuit_send_next_onion_skin(): Bug: Unexpectedly high circuit_successes (M/m) for guard <nick>.

Hundreds of these on a normal client? Are you doing anything atypical?

For example, are you running a hidden service? Running a relay? Running a network scanner? Using a custom Tor controller?

There are a few accompanying entries shown below, but note primarily that the above style uselessly lists only the <nick>, while the below style needlessly lists it.
[notice] Low circuit success rate M/n for guard <nick>=<fp>.

You get both this message and the above message for the same guard? I assume you mean the nicks match still? If so, for now you can still inspect your tor state to find the guards that have matching success rates, to verify you have two guards with the same nick but different EntryGuardPathBias counts.

What ranges of values of M, m, and n have you seen for each case?

comment:2 Changed 7 years ago by mikeperry

By tor state, I mean your state file. Where it lives varies by platform :/.

comment:4 Changed 7 years ago by grarpamp

Running a network scanner?

Polling onions, parallelism = 50 or so.

[your other questions...]

See log.

Nicks? Nicks are not unique, fp's are. Either drop line
printf of nicks, or print both nick and fp.

Please delete comment 3 and this line as redundant.

comment:5 Changed 7 years ago by mikeperry

Cc: armadev karsten ioerror added

nickm/arma/karsten: Are either of you guys aware of ways circuit_finish_handshake() might not get called in the hidden service code upon circuit completion? That sounds like what this but is, but I'm not really clear on how hidden service circuit construction differs from normal circuit construction. I assumed the codepaths were the same.

It's either that, or we missed a case where circuits can be counted for success twice in circuit_send_next_onion_skin().

If we can't find codepaths for either of these, I guess the next question is: Can we repro without hidden service use involved?

comment:6 Changed 7 years ago by nickm

circuit_finish_handshake() is called on every EXTENDED cell, and every CREATED cell we get from a first hop. That's it. ("git grep circuit_finish_handshake".)

The thing to look for instead would be calls to circuit_send_next_onion_skin, which is the function that actually increments the value in question. It is unfortunately called from ALL OVER the codebase.

The network scanner is the thing I'm interested in, not the hidden services. Check out handle_control_extendcircuit -- it calls circuit_send_next_onion_skin. Could controller-driven circuit creation be at issue here?

comment:7 Changed 7 years ago by grarpamp

if you're asking of my controller driving... none, all natural here. just parallel HS http fetches across onions. Plus:
maxclientcircuitspending 1024
trackhostexits .
trackhostexitsexpire 604800
usemicrodescriptors 0

comment:8 Changed 7 years ago by rransom

Could circuit cannibalization be involved here? (Add “[circ]info” to your torrc's Log line next to “notice” for the relevant messages.)

comment:9 Changed 7 years ago by grarpamp

So far there are a handful of circuit_launch_by_extend_info(): Cannibalizing circ for purposes 5, 6, 9.
So far none that mention a named guard from the high success entries. Each run is obviously going to be different, and this one isn't complete. Actually, right now, I've only got one guard as high success, and it's not listed in the statefile either.
I don't know much about what data you're looking for, so you might be best to compile a list of 300-400+ onions, then fire off 25-50 front page scrapes in parallel till the list is done. That way you can see what's going on, locally.

comment:10 Changed 7 years ago by oyvinds

[warn] circuit_send_next_onion_skin(): Bug: Unexpectedly high circuit_successes (144/132) for guard XXXXX

Hundreds of these on a normal client? Are you doing anything atypical?

# grep "Unexpectedly high circuit_successes" tor.log |wc -l
8090

Thousands of these. And I will not admit to be doing anything. I'm not even using Tor.

comment:11 Changed 7 years ago by nickm

mentioning this here so I don't forget:

18:45 < nickm> so the kludgey fix I thought is this:
18:46 < nickm> add a flag to origin_circuit_t
18:46 < nickm> That flag is: "Have we counted this circuit's first hop?"
18:46 < nickm> If it's set to 1, we do not count the first hop.e
18:46 < nickm> when we count the first hop, we set it to 1.
18:46 < nickm> if it's set to 0, we do not count the circuit as having 
               succeeded.
18:47 < nickm> when we count the circuit as having succeeded, we set it to 0
18:47 < nickm> plus side: no investigation would be needed
18:47 < nickm> minus side: we would never understand why this bug happened!
18:48 < nickm> plus side: future bugs of this kind couldn't occur
18:48 < nickm> plus side: if we added debugging code when we tried to count a 
               circuit as successful with the flag set to 0, then we might 
               track down where this bug was.

comment:12 Changed 7 years ago by mikeperry

Ok. Here's some initial analysis:

  1. There may be a few codepaths that allow circuit_send_next_onion_skin() to get called multiple times for the same circuit with a NULL onion_next_hop_in_cpath() value (which is the signal we use to determine the circuit is done). In particular the control port can do this for sure. rend_client_reextend_intro_circuit() and rend_client_receive_rendezvous() seem also able to do it. Possibly .exit notation, too.
  1. *However*, there is *no* codepath in circuit_send_next_onion_skin() that allows us to count a successful circuit without also setting circ->has_opened = 1. There are also no codepaths that *unset* circ->has_opened. Thus, double-counting on the success side should be impossible, even in the above cases for point 1.
  1. Therefore, this is an issue with the first_hop counting logic. Somehow circuits are completing a first hop without getting counted by entry_guard_inc_first_hop_count().
  1. There appear to be no codepaths that allow us to finish a hop in a circuit without going through circuit_finish_handshake(), though this is a bit trickier to determine for sure.

All of this together seems to point to an issue with the checks in circuit_finish_handshake() itself.

As a kludge and to protect against future bugs, nickm suggested creating a new field of origin_circuit_t that we set inside entry_guard_inc_first_hop_count() and then check for in circuit_send_next_onion_skin(), so that we are absolutely sure we only count successes for circuits whose first hops we counted. We could also update this field to a third value when we count the success, to avoid potentially insane cases where we *again* try to count a new first hop for a circuit after counting the success for it.

All of these unexpected cases should have logging for them.

Also, this might mean that whatever hidden service circumstances that are triggering this bug won't be protected by the Path Bias defense until we figure out what the hell codepaths are involved. Hopefully the log lines + a repro script will help us find them.

I'll get to work on a patch for this as soon as I can.

comment:13 Changed 7 years ago by mikeperry

grarpamp: Oh btw, if you'd like us to delete all of your above log data, I think we can do that. I'm not sure that I have the trac permissions to delete comments myself, but I can also find someone who does.

comment:14 Changed 7 years ago by mikeperry

Ok. There is an experimental patch in mikeperry/bug6475. I haven't tested it yet. I'll see if I can hack up a wget loop to let it run over night and see what happens.

comment:15 Changed 7 years ago by mikeperry

I appear to have successfully built 100 circuits without hitting any of those log lines. We'll see what happens with a hidden service wget loop.

comment:16 Changed 7 years ago by nickm

Status: assignedneeds_revision

Looking at the code:

  • These enum values need good documentation. There have been SO MANY cases where good code turned into a festering morass because we had enum values for states where we didn't clearly document what those values were.
  • My inner bit-miser wants to have that enum be a bitfield.
  • The blocks of code for this in circuit_finish_handshake and circuit_send_next_onion_skin should be extracted into a new function in a separate commit.

Other than that, seems plausible.

comment:17 Changed 7 years ago by nickm

oh yeah; "Needs a changes file"

comment:18 Changed 7 years ago by mikeperry

nickm: In circuit_finish_handshake() there are log messages and side effects (the 'hop' assignment) outside of the main codepath. Should I just duplicate the checks in the new function?

Also, this has been running all night fetching hidserv urls and I haven't reproduced or even hit any of the log messages. My tor client has built about 170 circuits of hidserv fetches since I started last night... I am just doing a few wgets in a shell loop.

grarpamp: Can you perhaps try reproducing with my branch? Or would you like me to directly attach a diff after addressing nick's comments?

comment:19 in reply to:  18 Changed 7 years ago by nickm

Replying to mikeperry:

nickm: In circuit_finish_handshake() there are log messages and side effects (the 'hop' assignment) outside of the main codepath. Should I just duplicate the checks in the new function?

I think that duplicating the checks in the new function and having the hop assignment be the only thing thing that happens in circuit_finish_handshake()'s if/else blocks would actually make the code here much clearer.

comment:20 Changed 7 years ago by mikeperry

nickm: I believe I have addressed all of your concerns in the commit I just pushed to mikeperry/bug6475. I also increased the log messages to notice-level and rate limited them.

grarpamp+others: I am still unable to reproduce. Let me know if you would like a patch to see if this actually prevents the warns.

comment:21 Changed 7 years ago by mikeperry

Status: needs_revisionneeds_review

comment:22 Changed 7 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

Merged; thanks

comment:23 Changed 7 years ago by mikeperry

Actual Points: 8
Points: 8

comment:24 Changed 7 years ago by mikeperry

Resolution: fixed
Status: closedreopened

Ok, we're hitting the debugging notice lines in the field. We've got two sets: One is related to config->UseEntryGuards == 0 (we should just disable pathbias accounting for this case), and the other is related to hidserv code:

Aug 20 19:10:09.000 [notice] pathbias_count_success(): Bug: Opened circuit is in strange path state new. Circuit is a Hidden service: Connecting to rendezvous point currently open. [9 similar message(s) suppressed in last 600 seconds]

comment:25 Changed 7 years ago by mikeperry

For reference, that log message corresponds to circ->has_opened == 1, PATH_STATE_NEW_CIRC, CIRCUIT_PURPOSE_S_CONNECT_REND and CIRCUIT_STATE_OPEN. Fun combo. This might be our real culprit for the original bug.

comment:26 Changed 7 years ago by mikeperry

Here's another one:

Aug 17 20:13:28.000 [notice] pathbias_count_first_hop(): Bug: A new circuit is in cpath state 2 (opened: 0). Circuit is a General-purpose client currently doing handshakes. [45 similar message(s) suppressed in last 600 seconds]

This one is even weirder. It's not clear what the hell is going on with that circuit.

comment:27 Changed 7 years ago by arma

Aug 17 16:54:45.000 [notice] pathbias_count_first_hop(): Bug: Unopened circuit has no known guard. Circuit is a Hidden service: Establishing introduction point currently doing handshakes.

sure looks funky.

comment:28 in reply to:  24 Changed 7 years ago by mikeperry

Replying to mikeperry:

Ok, we're hitting the debugging notice lines in the field. We've got two sets: One is related to config->UseEntryGuards == 0 (we should just disable pathbias accounting for this case), and the other is related to hidserv code:

Aug 20 19:10:09.000 [notice] pathbias_count_success(): Bug: Opened circuit is in strange path state new. Circuit is a Hidden service: Connecting to rendezvous point currently open. [9 similar message(s) suppressed in last 600 seconds]

I think the culprit for this one is rend_service_introduce(). It looks like it can launch CIRCUIT_PURPOSE_S_CONNECT_REND circuits and call circuit_init_cpath_crypto() without ever going through circuit_finish_handshake() to do so. Thus its first hops don't get counted. This could very well be the cause of our original warn bug.

comment:29 Changed 7 years ago by mikeperry

Actual Points: 811

I pushed a branch to mikeperry/bug6475-noguards to kill the UseEntryGuard 0 case. I will see about setting up my own hidden service to reproduce the message armadev hit.

comment:30 Changed 7 years ago by mikeperry

My relays see these messages:

Aug 20 19:39:27.000 [notice] pathbias_count_first_hop(): Bug: Unopened circuit has no known guard. Circuit is a Testing circuit currently doing handshakes.
Aug 20 19:39:28.000 [notice] pathbias_count_success(): Bug: Completed circuit has no known guard. Circuit is a Testing circuit currently open.

From choose_good_entry_server(), it looks like testing circuits never use your guard nodes. I pushed an explicit check for them to mikeperry/bug6475-noguards.

comment:31 Changed 7 years ago by mikeperry

Just pushed a commit to cover the case where a testing circuit without a guard got converted to a CBT measurement circ due to timeout:

Aug 20 19:40:57.000 [notice] pathbias_count_success(): Bug: Completed circuit has no known guard. Circuit is a Measuring circuit timeout currently open.

comment:32 Changed 7 years ago by nickm

Status: reopenedneeds_revision

Changes file needed, as always.

For the hidden service case, the easiest fix (for 0.2.3) would be to just disable the warning message, and look for a more complex fix in 0.2.4.

comment:33 in reply to:  32 ; Changed 7 years ago by mikeperry

Replying to nickm:

Changes file needed, as always.

For the hidden service case, the easiest fix (for 0.2.3) would be to just disable the warning message, and look for a more complex fix in 0.2.4.

The warning message should have stopped appearing by virtue of the actual state variable fix. We're just using the notices to narrow in on weird circuit construction codepaths now.

If you want to make these LD_BUG notice lines info-level for 0.2.3-stable, we can do that I guess. I'd like to keep them notice level for 0.2.4 though, so at least *somebody* sees them. They are potentially security holes/gaps in the defense for hidden service users, were we to actually deploy it.

comment:34 in reply to:  33 Changed 7 years ago by nickm

Replying to mikeperry:

Replying to nickm:

Changes file needed, as always.

For the hidden service case, the easiest fix (for 0.2.3) would be to just disable the warning message, and look for a more complex fix in 0.2.4.

The warning message should have stopped appearing by virtue of the actual state variable fix. We're just using the notices to narrow in on weird circuit construction codepaths now.

If you want to make these LD_BUG notice lines info-level for 0.2.3-stable, we can do that I guess. I'd like to keep them notice level for 0.2.4 though, so at least *somebody* sees them. They are potentially security holes/gaps in the defense for hidden service users, were we to actually deploy it.

Sounds good; let's do it that way.

comment:35 Changed 7 years ago by mikeperry

Actual Points: 1113

See 8384e94 in mikeperry/bug6647. Should I rebase everything in there into one commit for the debugging+log stuff for #6475 and a second commit for #6647? We can then merge that, and then make an additional, separate commit to master for bumping the logs back up for 0.2.4.x.

Also, FWIW, I'm also pummeling this on a test hidden service. It has completed ~8000 hidden service circuits without hitting any of the log messages...

comment:36 Changed 7 years ago by akrey

Got two of those, too. Tor v0.2.3.20-rc (git-24d7a06f04d701c4) running on Darwin; as downloaded.

circuit_send_next_onion_skin(): Bug: Unexpectedly high circuit_successes (4/3) for guard <removed>
circuit_send_next_onion_skin(): Bug: Unexpectedly high circuit_successes (5/4) for guard <removed>

Regular client node, quite possibly during access to a single hidden service. No other log messages.

comment:37 Changed 7 years ago by nickm

I'm planning to merge branch "nickm/disable_pathbias_warnings" into 0.2.3, and try Mike's stuff in 0.2.4.

comment:38 Changed 7 years ago by nickm

Just merged nickm/disable_pathbias_warnings_v2 into 0.2.3 only.

comment:39 Changed 7 years ago by nickm

Resolution: fixed
Status: needs_revisionclosed

Merged bug6647 into master; closing this ticket.

comment:40 Changed 7 years ago by grarpamp

Yes, with 0.2.3.20, and default maxclientcircuitspending, there are:

  • Low circuit success rate
  • circuit_send_next_onion_skin(): Bug: Unexpectedly high circuit_successes

With maint-0.2.3 4bd90e20b92fbe2b93fc53e24254f00e663d057e there is:

  • None of this so far :)

I'm doing 1 fetch per onion, with 25 onions open at once, till queued input onion list exhausts. Doubled up. So about 50. This uses 80-90% of my CPU for tor, I'll scale up later. I'm unqualified to comment on any of the above work/plans. Sorry about the reply lag.

comment:41 Changed 7 years ago by grarpamp

mp: if i'm still needed to test a branch, ping me. though i presume no longer so.

comment:42 Changed 7 years ago by nickm

Keywords: tor-client added

comment:43 Changed 7 years ago by nickm

Component: Tor ClientTor
Note: See TracTickets for help on using tickets.