Clients should warn and disable guards responsible for excessive circuit failures

changed milestone to %Tor: 0.2.3.x-final

added MikePerry201206 actualpoints::14 (nickm:6 mikeperry:8) component::core tor/tor milestone::Tor: 0.2.3.x-final owner::mikeperry parent::5456 points::6 priority::high resolution::implemented status::closed tor-client type::defect labels

Trac:
Parent: N/A to #5456 (moved)

This looks related to #5452 (moved).

Hrmm. We also want to record per-guard/bridge failure counts, too, so that if just one guard fails too many circuits, we warn.

I think this should be relatively easy to implement, with the one exception that we should only count failures while a functional ORConn exists to the guard/bridge, or perhaps only if the first hop succeeds (to avoid counting during network downtime).

I'm going to give it a shot next month and see what happens.

Trac:
Points: N/A to mikeperry
Keywords: N/A deleted, MikePerry201205 added

nickm+arma: I have a rough draft of a patch in mikeperry/bug5458, but I have quite a few questions. They are listed in the commit message and in the XXX's in the source, but here's a copy of my questions for easy reply on trac:

The EntryGuard state file format and parsing code is extra strength crazy sauce. Have I modified it correctly?
I think we should actually remove guards/bridges that exceed our warning threshold (there's an XXX where we should do this). The question is: how? Can I just remove them from the list? What about bridges? Will they just get added back? Can I remove them from the user's config, or is that too rude?
Did I handle bridges correctly in the first place? I think I can treat them like guards in most cases in the code, but can I really?
Do I need to do anything special for 4 hop circuits/cannibalized circuits?

Trac:
Cc: N/A to arma, nickm

Trac:
Owner: N/A to mikeperry
Points: mikeperry to 6
Status: new to assigned

I just realized I accidentally left some log messages at warn that should probably be at debug. Please ignore, I'll demote them. I'll also remove the useless space at the end of routerinfo_t.

Replying to mikeperry:

In https://lists.torproject.org/pipermail/tor-dev/2012-March/003361.html, the raccoon gave us some bounds to determine unreasonable amounts of circuit failure that indicate a potential tagging attack.

Tor Clients should emit warnings if they observe anywhere near this amount of circuit failure. His bound for tagging was 80%, but frankly if even 50% of my circuits are failing, I would want tor to tell me.

Maybe >50% failure should notice and >66% failure should warn?

Have you measured the usual percentage of circuits failing?

IIRC, someone in IRC was tracking his own circuit failure rate, and I vaguely remember 30% of his circuits failing. It might be a good idea to do some research on how and why our circuits are failing before deploying the change of this ticket.

We might also want to make the percentages configurable, so that users can tweak them depending on their paranoia.

I like the idea of making it configurable, perhaps as a consensus parameter. I'll add that in the final version. I don't think we need to hold off on exhaustive studies though. We should just pick bounds high enough that we're sure they're bad news if they actually happen. I think the best study will be just to run it for a while and see what the results look like for guards and bridges.

At a later date, we can potentially use #5459 (moved) to discover tighter consensus params if we really wish.

Also, I forgot to mention that I've tested the functionality of the patch and it does appear to at least properly accumulate stats, and properly save and restore stats.

Oooh! A patch!

Trac:
Status: assigned to needs_review

Replying to rransom:

Oooh! A patch!

Indeed! (My workflow does a very bad job of showing me tickets that are neither given a milestone nor put in needs_review. I'm throwing this into 0.2.2.x for now, but I think 0.2.3.x might be more appropriate.)

Quick notes on the code (actual stuff and cosmetic stuff all mixed together):

PATH_BIAS_* seem like they want to become configurable, or let via a consensus param, or both.
is_an_entry_guard() is misnamed: any function with a predicate name like that should return a boolean. If we need a function that gets a digest and returns an entry_guard_t *, it should have a name like entry_guard_get_by_id_digest() or something.
Also, there's no real reason to make it inline.
I would be much more comfortable if callers of is_an_entry_guard() [or whatever it gets renamed to] checked to make sure it returns non-null, rather than asserting. Like, what if the node were to stop being an entry guard for some reason after the circuit is launched?
Obviously, the log_warns need to be turned into info or debug before this goes into production.
spurious newline added to or.h.
Using u64 for the circuit count seems insanely high. If we have been using an entry guard for long enough to get anything like UINT32_MAX circuits, surely we don't want to be counting the older circuits any more. I think that the "scale everything down" feature should really kick in long before INT64_MAX. Like, it doesn't seem unreasonable to me for this to kick in even at 500 or 1000 circuits and divide both values by 2 or 5 or 10. Maybe this should be a consensus parameter too; what do you think?
The huge new blob in circuit_finish_handshake() should really get extracted into a new function for readability.
For the entry_guards_parse_state() string, wouldn't it be much easier to implement this if we just had the two values stored in separate state fields? That way we wouldn't need to have the state management functions mess around with string manipulation.
Failing that, "hop = (uint32_t)tor_parse_uint64" looks wrong, as does "success = (build_time_t)...". Also, you can't use "%ld" to snprintf an unsigned 64-bit int; look at U64_FORMAT and U64_PRINTF_ARG if you really need to do that.
For some of the stuff you're doing with base16_encode() now, you should be using hex_str(), I think.

More general notes:

I want other people's opinions here, but this doesn't feel like an 0.2.2.x change to me.
A four-hop circuit IMO doesn't need special treatment. A cannibalized circuit, though, will get counted as having succeeded twice, which isn't necessarily what we want. This could throw off results if a large enough fraction of circuits gets cannibalized. It
Is it possible to have PATH_BIAS_MIN_CIRCS on a guard all in progress at once? If so, there might be an annoying problem when starting up a new guard.
Removing stuff from the list is probably not a great idea, since it wouldn't keep the node from getting re-added. Instead, you probably want to flag the node as invalid or something.

More thought-requiring notes:

We should do the math to see how successful the attack can be under different parameter choices. (Under the current parameter choices, it seems like you can do route capture pretty darn effectively so long as you treat each new client honestly for a sufficiently large number of circuits before doing the attack.)
We should figure out what kind of false positive rate we expect, and document that, and maybe even mention it in the warning. (A tiny but not vanishing false positive rate, multiplied by a very large number of users and a lot of time, means that we should expect some number of spurious reports.)

Trac:
Milestone: N/A to Tor: 0.2.2.x-final

nickm: All of the above sounds fine. I'll try to get through all of them in the next revision. I assume you'll just want a fresh branch with a single commit?

Also, I'm still wondering how to handle removing guards and bridges (or perhaps just marking them as down/invalid/bad?) in the event they hit the 'warn' limit. The most surefire way seems to be to remove them from both bridge and guard lists, as well as remove them from the config (if present), but this seems like it might be rude. Perhaps marking them to be avoided somehow is better? Any suggestions?

Also, as a general process matter for future revisions, do you also prefer needs_review tickets to be assigned to you to notice, or is it enough for you to be in the Cc?

Replying to mikeperry:

nickm: All of the above sounds fine. I'll try to get through all of them in the next revision. I assume you'll just want a fresh branch with a single commit?

Either way is fine. For bigger stuff, I'd prefer multiple commits, possibly to be squashed, but this is a small enough patch I think I can read the new version well either way.

Also, I'm still wondering how to handle removing guards and bridges (or perhaps just marking them as down/invalid/bad?) in the event they hit the 'warn' limit. The most surefire way seems to be to remove them from both bridge and guard lists, as well as remove them from the config (if present), but this seems like it might be rude. Perhaps marking them to be avoided somehow is better? Any suggestions?

Removing them has the problem that they can just get added in. Bridges, for example, will get inserted again the next time we load the configuration. I think the right answer is marking them as to be avoided. This could be done as a new flag, or by having the code that looks at the nodes' status to find good / sufficent entries also check to see if the ratio of failed->attempted circuits is too high.

Also, as a general process matter for future revisions, do you also prefer needs_review tickets to be assigned to you to notice, or is it enough for you to be in the Cc?

I don't need either; I look at all the needs_review tickets assigned to a Tor component fairly regularly. The important fields for me are the status, the component, and the milestone.

Self: Don't forget to include a changes file, too.

Trac:
Status: needs_review to needs_revision

I haven't read the original thread, but I too am worried that we'll run into another situation where we optimize things for Mike's network connection and it produces an unending stream of issues from people who have different network connections.

Replying to nickm:

I want other people's opinions here, but this doesn't feel like an 0.2.2.x change to me.

Yeah, there's no reason a potential fix like this, especially with its high chance of unexpected side effects, should go into a solidly frozen branch like 0.2.2.

Also, I haven't looked at any of this closely, but it appears that the bug title talks about log messages, and some of the comments here talk about automatically altering client behavior?

Trac:
Summary: Clients should warn in the event of excessive circuit failures to Clients should warn and disable guards responsible for excessive circuit failures

Trac:
Description: In https://lists.torproject.org/pipermail/tor-dev/2012-March/003361.html, the raccoon gave us some bounds to determine unreasonable amounts of circuit failure that indicate a potential tagging attack.

Tor Clients should emit warnings if they observe anywhere near this amount of circuit failure. His bound for tagging was 80%, but frankly if even 50% of my circuits are failing, I would want tor to tell me.

Maybe >50% failure should notice and >66% failure should warn?

to

In https://lists.torproject.org/pipermail/tor-dev/2012-March/003361.html, the raccoon gave us some bounds to determine unreasonable amounts of circuit failure that indicate a potential tagging attack.

Tor Clients should emit warnings and disable guards if they observe anywhere near this amount of circuit failure per guard. His bound for tagging was 80%, but frankly if even 50% of my circuits are failing, I would want tor to tell me.

Maybe >50% failure should notice and >66% failure should warn?
Milestone: Tor: 0.2.2.x-final to Tor: 0.2.3.x-final

I worry about lots of stuff, too.

Replying to arma:

I haven't read the original thread, but I too am worried that we'll run into another situation where we optimize things for Mike's network connection and it produces an unending stream of issues from people who have different network connections.

As mike implies, there's not a lot we can do here: I think our best bet is to actually try this and tune as we go.

This is IMO less likely to cause trouble than the cbt stuff did, since it shouldn't be dependent on local timing or network issues. Rather, it depends on the failure rate for extending from your guards. I don't think that should be mike-dependent, but who can say.

Mike, one thing I can't remember if we talked about or not: I'm not going to insist on a full proposal here, but I would like the final version to come with a few paragraphs of explanation for path-spec.txt or somewhere else that's appropriate.

Mike says that he might not get to revising this one till June. If I find myself with free time before then, I might want to take a shot at it.

Nick: I'm putting this one on my radar for June, but if you decide you just want to bang it out, don't let that stop you :).

Trac:
Keywords: MikePerry201205 deleted, MikePerry201206 added

I pushed a "bug5458" to my repository that handles all of my issues above other than:

A four-hop circuit IMO doesn't need special treatment. A cannibalized circuit, though, will get counted as having succeeded twice, which isn't necessarily what we want. This could throw off results if a large enough fraction of circuits gets cannibalized.
Is it possible to have PATH_BIAS_MIN_CIRCS on a guard all in progress at once? If so, there might be an annoying problem when starting up a new guard.
We should do the math to see how successful the attack can be under different parameter choices. (Under the current parameter choices, it seems like you can do route capture pretty darn effectively so long as you treat each new client honestly for a sufficiently large number of circuits before doing the attack.)
We should figure out what kind of false positive rate we expect, and document that, and maybe even mention it in the warning. (A tiny but not vanishing false positive rate, multiplied by a very large number of users and a lot of time, means that we should expect some number of spurious reports.)

Totally untested! Needs review. Also:

Needs a path-spec.txt writeup.
Needs a changes file.
Needs to have the new parameters noted in dir-spec.

Also Mike Perry owes me a tasty beverage, to be added to the innumerable tally of tasty beverages that Mike and I owe one another. And if the ratio of beverages ordered to bevereges enjoyed becomes too high, we need to move to another restaurant.

Trac:
Status: needs_revision to needs_review

Is there a self-contained description somewhere of what exactly the attack is, how we solve it, and why our solution actually solves it?

arma: http://www.crhc.uiuc.edu/~nikita/papers/relmix-ccs07.pdf is the closest we've got.

Perhaps if we're really lucky, academia will produce another hot sexy new attack paper that "proves" all defenses are broken forever ;)

nick: I'll try to take a look at this as soon as I can.

mikeperry: One more thing to think about is how this interacts with cbt stuff. If my cbt is too low, will I complain that all my guards are attacking me?

(FWIW, this is important, but we cannot defer 0.2.3.x for it.)

So long as we agree that we want this merged into 0.2.3.x as soon as it's ready for merge, that's fine with me. I hopefully will be reviewing it tomorrow. I also created #6135 (moved) for tightening the parameters for it, so we can start out loose to avoid weird Guard-killing breakage due to CBT and other things.

Ok I reviewed the patch. It looks OK to me, modulo a couple remaining concerns. So, here's the action items for mike:

Add back in calls to entry_guards_changed() from my branch
Set a flag on cannibalized circuits and check it to avoid double-counting
Run the thing for a little while
Solve those two math problems if I can, or at least stochasticcally estimate them in a python snippet

Then after that, we can do the doc updates. We probably want the math before the docs, if we can manage it.

1a. For now, change the default value for the threshold at which to disable a guard to "never disable a guard".

Trac:
Status: needs_review to needs_revision

Ok, mikeperry/bug5458 should have 1, 1a, and 2 in it, as well as some check-spaces fixes.

Still haven't run it yet :).

+  if ((guard->first_hops % (unsigned)pathbias_get_min_circs(options)) == 0) {
+      entry_guards_changed();
+  }

That's a kludge, and I think it might be a needless one. We can just say entry_guards_changed(), and that will cause a state write to happen in 10 minutes or an hour, depending. Am I wrong?

Using "0" for the "never disable circuits" value seems a little worrisome to me; it relies on us using < rather than <= for when we compare the actual failure rate to the threshold. Should I worry?

Rebased onto master and squashed as bug5458_rebased. Please add future commits based on that branch.

Replying to nickm:

{{{
if ((guard->first_hops % (unsigned)pathbias_get_min_circs(options)) == 0) {
 entry_guards_changed();
} }}}
That's a kludge, and I think it might be a needless one. We can just say entry_guards_changed(), and that will cause a state write to happen in 10 minutes or an hour, depending. Am I wrong?

Yeah, I was just reluctant to cause the equivalent action of "Write your state file every 10 minutes". I guess we don't care? I'll change entry_guard_inc_first_hop_count() to call entry_guards_changed() every time.

Using "0" for the "never disable circuits" value seems a little worrisome to me; it relies on us using < rather than <= for when we compare the actual failure rate to the threshold. Should I worry?

I guess your worry is someone might change it later and leave an opportunity to break it? Let's just futureproof the comparison with a comment rather than try to support negative failure rates and have all sorts of signed vs unsigned comparison issues. Does that satisfy your concern, or is it something deeper? I don't see any opportunities for integer overflow, if that's what you mean?

The branch to look at is now mikeperry/bug5458_squashed .

Trac:
Status: needs_revision to needs_review

Merged it into 0.2.3.x Thanks!

Trac:
Resolution: N/A to implemented
Status: needs_review to closed

I spent 2 points on the rough draft back in May. I also spent an additional 6 points going back and forth with Nick. From git logs + wild guessing, it looks like Nick spent an additional 4-6 points working on this, too.

Trac:
Actualpoints: N/A to 14 (nickm:6 mikeperry:8)