circuit_send_next_onion_skin(): Bug: Unexpectedly high circuit_successes

changed milestone to %Tor: 0.2.3.x-final

added MikePerry201208 actualpoints::13 component::core tor/tor milestone::Tor: 0.2.3.x-final owner::mikeperry points::8 priority::high resolution::fixed status::closed tor-client type::defect version::tor 0.2.3.19-rc labels

I'm seeing hundreds of these: [warn] circuit_send_next_onion_skin(): Bug: Unexpectedly high circuit_successes (M/m) for guard .

Hundreds of these on a normal client? Are you doing anything atypical?

For example, are you running a hidden service? Running a relay? Running a network scanner? Using a custom Tor controller?

There are a few accompanying entries shown below, but note primarily that the above style uselessly lists only the , while the below style needlessly lists it. [notice] Low circuit success rate M/n for guard =.

You get both this message and the above message for the same guard? I assume you mean the nicks match still? If so, for now you can still inspect your tor state to find the guards that have matching success rates, to verify you have two guards with the same nick but different EntryGuardPathBias counts.

What ranges of values of M, m, and n have you seen for each case?

Trac:
Status: new to assigned
Owner: mikeperry@torproject.org to mikeperry
Cc: nickm@torproject.org to nickm
Keywords: N/A deleted, MikePerry201208 added
Priority: normal to major

By tor state, I mean your state file. Where it lives varies by platform :/.

Running a network scanner?

Polling onions, parallelism = 50 or so.

[your other questions...]

See log.

Nicks? Nicks are not unique, fp's are. Either drop line printf of nicks, or print both nick and fp.

Please delete comment 3 and this line as redundant.

nickm/arma/karsten: Are either of you guys aware of ways circuit_finish_handshake() might not get called in the hidden service code upon circuit completion? That sounds like what this but is, but I'm not really clear on how hidden service circuit construction differs from normal circuit construction. I assumed the codepaths were the same.

It's either that, or we missed a case where circuits can be counted for success twice in circuit_send_next_onion_skin().

If we can't find codepaths for either of these, I guess the next question is: Can we repro without hidden service use involved?

Trac:
Cc: nickm to nickm, armadev, karsten, ioerror

circuit_finish_handshake() is called on every EXTENDED cell, and every CREATED cell we get from a first hop. That's it. ("git grep circuit_finish_handshake".)

The thing to look for instead would be calls to circuit_send_next_onion_skin, which is the function that actually increments the value in question. It is unfortunately called from ALL OVER the codebase.

The network scanner is the thing I'm interested in, not the hidden services. Check out handle_control_extendcircuit -- it calls circuit_send_next_onion_skin. Could controller-driven circuit creation be at issue here?

if you're asking of my controller driving... none, all natural here. just parallel HS http fetches across onions. Plus: maxclientcircuitspending 1024 trackhostexits . trackhostexitsexpire 604800 usemicrodescriptors 0

Could circuit cannibalization be involved here? (Add “[circ]info” to your torrc's Log line next to “notice” for the relevant messages.)

So far there are a handful of circuit_launch_by_extend_info(): Cannibalizing circ for purposes 5, 6, 9. So far none that mention a named guard from the high success entries. Each run is obviously going to be different, and this one isn't complete. Actually, right now, I've only got one guard as high success, and it's not listed in the statefile either. I don't know much about what data you're looking for, so you might be best to compile a list of 300-400+ onions, then fire off 25-50 front page scrapes in parallel till the list is done. That way you can see what's going on, locally.

[warn] circuit_send_next_onion_skin(): Bug: Unexpectedly high circuit_successes (144/132) for guard XXXXX

Hundreds of these on a normal client? Are you doing anything atypical?

grep "Unexpectedly high circuit_successes" tor.log |wc -l

8090

Thousands of these. And I will not admit to be doing anything. I'm not even using Tor.

Trac:
Username: oyvinds

mentioning this here so I don't forget:


18:45 < nickm> so the kludgey fix I thought is this:
18:46 < nickm> add a flag to origin_circuit_t
18:46 < nickm> That flag is: "Have we counted this circuit's first hop?"
18:46 < nickm> If it's set to 1, we do not count the first hop.e
18:46 < nickm> when we count the first hop, we set it to 1.
18:46 < nickm> if it's set to 0, we do not count the circuit as having 
               succeeded.
18:47 < nickm> when we count the circuit as having succeeded, we set it to 0
18:47 < nickm> plus side: no investigation would be needed
18:47 < nickm> minus side: we would never understand why this bug happened!
18:48 < nickm> plus side: future bugs of this kind couldn't occur
18:48 < nickm> plus side: if we added debugging code when we tried to count a 
               circuit as successful with the flag set to 0, then we might 
               track down where this bug was.

Ok. Here's some initial analysis:

There may be a few codepaths that allow circuit_send_next_onion_skin() to get called multiple times for the same circuit with a NULL onion_next_hop_in_cpath() value (which is the signal we use to determine the circuit is done). In particular the control port can do this for sure. rend_client_reextend_intro_circuit() and rend_client_receive_rendezvous() seem also able to do it. Possibly .exit notation, too.
However, there is no codepath in circuit_send_next_onion_skin() that allows us to count a successful circuit without also setting circ->has_opened = 1. There are also no codepaths that unset circ->has_opened. Thus, double-counting on the success side should be impossible, even in the above cases for point 1.
Therefore, this is an issue with the first_hop counting logic. Somehow circuits are completing a first hop without getting counted by entry_guard_inc_first_hop_count().
There appear to be no codepaths that allow us to finish a hop in a circuit without going through circuit_finish_handshake(), though this is a bit trickier to determine for sure.

All of this together seems to point to an issue with the checks in circuit_finish_handshake() itself.

As a kludge and to protect against future bugs, nickm suggested creating a new field of origin_circuit_t that we set inside entry_guard_inc_first_hop_count() and then check for in circuit_send_next_onion_skin(), so that we are absolutely sure we only count successes for circuits whose first hops we counted. We could also update this field to a third value when we count the success, to avoid potentially insane cases where we again try to count a new first hop for a circuit after counting the success for it.

All of these unexpected cases should have logging for them.

Also, this might mean that whatever hidden service circumstances that are triggering this bug won't be protected by the Path Bias defense until we figure out what the hell codepaths are involved. Hopefully the log lines + a repro script will help us find them.

I'll get to work on a patch for this as soon as I can.

grarpamp: Oh btw, if you'd like us to delete all of your above log data, I think we can do that. I'm not sure that I have the trac permissions to delete comments myself, but I can also find someone who does.

Ok. There is an experimental patch in mikeperry/bug6475. I haven't tested it yet. I'll see if I can hack up a wget loop to let it run over night and see what happens.

I appear to have successfully built 100 circuits without hitting any of those log lines. We'll see what happens with a hidden service wget loop.

Looking at the code:

These enum values need good documentation. There have been SO MANY cases where good code turned into a festering morass because we had enum values for states where we didn't clearly document what those values were.
My inner bit-miser wants to have that enum be a bitfield.
The blocks of code for this in circuit_finish_handshake and circuit_send_next_onion_skin should be extracted into a new function in a separate commit.

Other than that, seems plausible.

Trac:
Status: assigned to needs_revision

oh yeah; "Needs a changes file"

nickm: In circuit_finish_handshake() there are log messages and side effects (the 'hop' assignment) outside of the main codepath. Should I just duplicate the checks in the new function?

Also, this has been running all night fetching hidserv urls and I haven't reproduced or even hit any of the log messages. My tor client has built about 170 circuits of hidserv fetches since I started last night... I am just doing a few wgets in a shell loop.

grarpamp: Can you perhaps try reproducing with my branch? Or would you like me to directly attach a diff after addressing nick's comments?

Replying to mikeperry:

nickm: In circuit_finish_handshake() there are log messages and side effects (the 'hop' assignment) outside of the main codepath. Should I just duplicate the checks in the new function?

I think that duplicating the checks in the new function and having the hop assignment be the only thing thing that happens in circuit_finish_handshake()'s if/else blocks would actually make the code here much clearer.

circuit_send_next_onion_skin(): Bug: Unexpectedly high circuit_successes

Child items ...

Activity

grep "Unexpectedly high circuit_successes" tor.log |wc -l