Tariq's COGS paper from WPES 2012 shows that a significant component of guard churn is due to voluntary rotation, rather than actual network changes:
http://freehaven.net/anonbib/#wpes12-cogs
In short, if the target client makes sensitive connections continuously every day for months, and you (the attacker) run some fast guards, the odds get pretty good that you'll become the client's guard at some point and get to do a correlation attack.
We could argue that the "continuously every day for months" assumption is unrealistic, so in practice we don't know how bad this issue really is. But for hidden services, it could well be a realistic assumption.
There are going to be (at least) two problems with raising the guard rotation period. The first is that we unbalance the network further wrt old guards vs new guards, and I'm not sure by how much, so I'm not sure how much our bwauth measurers will have to compensate. The second (related) problem is that we'll expand the period during which new guards don't get as much load as they will eventually get. This issue already results in confused relay operators trying to shed their Guard flag so they can resume having load.
In sum, if we raise the rotation period enough that it really results in load changes, then we could have unexpected side effects like having the bwauths raise the weights of new (and thus totally unloaded) guards to huge numbers, thus ensuring that anybody who rotates a guard will basically for sure get one of these new ones.
The real plan here needs a proposal, and should be for 0.2.5 or later. I wonder if we can raise it 'some but not too much' in the 0.2.4 timeframe though?
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
Nick's patch raises the guard rotation period to ~9.5 months (from ~1.5 months).
If we keep giving out the Guard flag in the same way, and it remains the case that well more than half of the capacity in the network has the Guard flag (~65% on https://metrics.torproject.org/network.html#bwhist-flags), and the median byte of guard capacity has had the Guard flag for at most 4.75 out of the last 9.5 months (I just made that number up, but I bet there exist times when it's plausible), then we basically just threw away >1/3 of our total network capacity by having clients never use it when they could have. Our bwauths might try to compensate by blowing up the weights of those new nodes, but from a security perspective that's exactly what we don't want (especially if they're Exits too, and the same weight inflates their chance of being used as an exit too).
Our current client weighting in path selection assumes a steady-state where everybody with the Guard flag has had it long enough to attract its fair share of users. This isn't true now, but we've been doing ok pretending it is. I fear we won't be able to pretend once you need to have run your Guard for nine months before you hit steady-state.
I like the idea of putting in a parameter now, so we can teach clients to obey the parameter now, and change it later. But I think clients need to know how close to steady-state a guard is, so they can balance appropriately. Is that a new weight on the w line? Or something else?
I'm cc'ing Mike here, since he started the whole balance-by-position-in-path strategy; and Ian and Tariq, since they worked on the COGS paper; and Ralf, since he touched on this issue in his upcoming Oakland paper.
While we're planning, though: it seems that hidden services are extra vulnerable to this issue, since they don't move and since the adversary can induce them to talk. Should we disable guard rotation for hidden services? Or just crank up its rotation period a lot?
So long as hidden services aren't a big piece of network traffic, such a move shouldn't influence overall network load balancing, and should help the hidden services a lot.
These are hard questions. People already hate the fact that when their relays get the guard flag: throughput drops off for days. I don't believe the transition takes weeks though (probably thanks to the bw auths), but I have not studied it in detail.
One way to improve this balancing problem might to adjust the Wxx weights such that the guard ones are dependent on how long you've had the guard flag vs this rotation parameter. If we had a curve to model the migration rate and metadata to record the Guard flag age to create points on this curve, this might not be too hard to do. I suppose a uniform migration rate might be as good an assumption as any...
However, personally, I think that in reality clients are rotating off of their guards much quicker than even the 1.5mo limit. At least, it feels like my Tor clients were doing that when watching path bias counts.. I think this might be the same problem you describe when talking about the age of the median byte of the Guard flag (Guards may actually already going up and down/losing their flags way faster than our limits). For this reason, I'm wondering if simply changing the rotation period to 9.5mos might not actually change the rotation rate in practice.
One way to improve this balancing problem might to adjust the Wxx weights such that the guard ones are dependent on how long you've had the guard flag vs this rotation parameter. If we had a curve to model the migration rate and metadata to record the Guard flag age to create points on this curve, this might not be too hard to do. I suppose a uniform migration rate might be as good an assumption as any...
I agree that a uniform migration rate is as good as any (I assume by migration you mean from clients with the old behavior to clients with the new behavior). But further, don't forget that another factor here is new users showing up and picking guards. I guess we could assume that those are negligible (not true but hey, maybe it's close enough).
I like the notion of changing the weights, but I feel like inflating the Bandwidth= weight is the wrong way to do it. I increasingly think we need a per-relay thing to say "how much of a guard it is".
However, personally, I think that in reality clients are rotating off of their guards much quicker than even the 1.5mo limit. At least, it feels like my Tor clients were doing that when watching path bias counts.. I think this might be the same problem you describe when talking about the age of the median byte of the Guard flag (Guards may actually already going up and down/losing their flags way faster than our limits). For this reason, I'm wondering if simply changing the rotation period to 9.5mos might not actually change the rotation rate in practice.
See Figures 4 and 5 in the COGS paper. At least during the time period of that Tor network dataset in 2011, voluntary rotation was a much bigger risk component than natural churn.
One way to improve this balancing problem might to adjust the Wxx weights such that the guard ones are dependent on how long you've had the guard flag vs this rotation parameter. If we had a curve to model the migration rate and metadata to record the Guard flag age to create points on this curve, this might not be too hard to do. I suppose a uniform migration rate might be as good an assumption as any...
I agree that a uniform migration rate is as good as any (I assume by migration you mean from clients with the old behavior to clients with the new behavior). But further, don't forget that another factor here is new users showing up and picking guards. I guess we could assume that those are negligible (not true but hey, maybe it's close enough).
Actually no, I mean migration rate in terms of how quickly new guards can expect to accumulate their proper fraction of clients actually using them as a Guard node. The problem I'm describing is that giving new relays a Guard flag means the weights from https://gitweb.torproject.org/torspec.git/blob/master:/path-spec.txt#l206 cause fresh guards get substantially less clients until people migrate. Increasing the rotation period would exacerbate this problem. Hence, we might want to use an additional computation on the Wg* and W*g weights.
In fact there may be two rates at work here: the natural rate of migration of clients to your new Guard node, and then later, the fraction of Guard flagged nodes who are of a certain age. Both of these will require some kind of annotation or record keeping on the authority side to compute, as they are likely best represented as points along the (0, 9.5mo] domain of two different curves.
I like the notion of changing the weights, but I feel like inflating the Bandwidth= weight is the wrong way to do it. I increasingly think we need a per-relay thing to say "how much of a guard it is".
I like the notion of changing the weights, but I feel like inflating the Bandwidth= weight is the wrong way to do it. I increasingly think we need a per-relay thing to say "how much of a guard it is".
I guess I'm not being clear. Here's an attack, if we continue having only one weight per relay in the consensus. Let's say a new mid-sized adversarial exit relay shows up. It has the Exit flag, no guard flag, and not a very high Bandwidth= number on the w line, since it's being used as an Exit and a Middle, so it isn't super-impressive with its download speeds.
When it earns the Guard flag, clients will back off from using it as the middle hop, and partially back off from using it as the exit hop, since they assume other clients will be using it as a guard. So its usage will go way down.
In this ticket we remark several times that we hope the bwauths will then find it to be much faster, and give it a larger Bandwidth= weight, so clients will more quickly pick it as a Guard.
But as a side effect, we have just inflated the chance that clients will pick it as an Exit too, since it's the same Bandwidth= weight that tells how useful it 'should' be for either position. So an adversary can arrange to run lots of these newly-got-the-Guard-flag relays and get more than his fair share of exit traffic.
Instead I'm suggesting that we have a second weight on that w line, which shows, for each guard, how much of his steady-state client quota we think he should have by now. And then clients would use this number to treat a half-way-there guard as having more available capacity for other path positions than an all-the-way-there guard.
I agree that this complicates things. I don't see a way of doing it without having a new parameter, per relay, though.
I'm not talking about the "Bandwidth=" weights. I'm talking about the flag weights. In fact, it appears to me that I am suggesting exactly the same thing you are, just using a different mechanism (one that existing clients already obey today).
Could one/both you spell out with more exactitude what additional fix you prefer? I'll implement something if I need to, but I'd rather have somebody else figure out what to implement. Please don't leave any steps out.
Also, does any of the above militate against actually merging this patch, possibly with the default value a little lower (3 months?), and a plan to move the default value higher once we have the Guard flag/W parameters/whatever working like we'd like?
I'm not talking about the "Bandwidth=" weights. I'm talking about the flag weights. In fact, it appears to me that I am suggesting exactly the same thing you are, just using a different mechanism (one that existing clients already obey today).
Great. What do we change in these weights then? I still don't see with these system-wide weights how we can tell clients to still back off from using a Guard-for-a-long-time relay for other path positions, but not back off so much from using a Guard-for-just-a-short-time relay for other path positions.
guards_get_lifetime()'s comments says it's about "directory guards", but I think it's about the other kinds of guards too, yes?
I believe I misremembered the code when making some of the above comments -- I now believe setting GuardLifetime to 9 months will make your guards last between 8 and 9 months (as opposed to between 9 and 10 months). Since your consensus param wants to be "minimum lifetime" (which is a fine choice), we should deal with that fencepost issue internally by adding on an extra fencepost or something.
I notice that our time_units array doesn't know what a month is. (I noticed because if we add a month and define it as 30 days, then 2 months, which is the current value, will be under your MIN_GUARD_LIFETIME value.) (That said, see the above fencepost issue -- I think the current value is actually best described as "1 month", meaning "at least 1 month", i.e. when it chooses an expiration time it chooses it up to 30 days in the past, and when it checks for expiration it checks if 60 days have passed.)
Your patch doesn't change the two comments in remove_obsolete_entry_guards() that say "2 months".
I'd be fine changing the value to "at least 2 months" while we're discussing how to deal with the weights issue.
I'm not talking about the "Bandwidth=" weights. I'm talking about the flag weights. In fact, it appears to me that I am suggesting exactly the same thing you are, just using a different mechanism (one that existing clients already obey today).
Great. What do we change in these weights then? I still don't see with these system-wide weights how we can tell clients to still back off from using a Guard-for-a-long-time relay for other path positions, but not back off so much from using a Guard-for-just-a-short-time relay for other path positions.
Ugh, I think I have braindamage from juggling too many things. You're right, individual relays can't be re-weighted in this way currently. We would need client side changes for what I described: we'd need to get the duration that each node has had the Guard flag to the client somehow, and then the client would have to adjust that node's Wg weights themselves..
Either way, it's not something we'd do on the 0.2.4.x timescale. For now, we should avoid raising the new limit too far beyond its current value..
For the branch for 0.2.4.x: I agree we should default to 2 or 3 months instead of 9 months here. I also took a quick look at the branch and it seems weird to clamp the torrc option silently. If we're going to alter user torrc values, it should probably be in options_validate() with a log message. I also think 2 months is a high minimum, especially for torrc. Seems like "10 minutes" is a better minimum there. The consensus I agree we might want to bottom out at a month.
For 0.2.5.x when we actually change this to a larger value: I thought about the weight discussion a bit more. It of course needs a proposal to make it specific enough to implement, but I think the best option would be to create a new consensus method that allows each relay to optionally have a subset of the bandwidth-weights keyword pairs (the Wxx ones used by compute_weighted_bandwidths() and smartlist_choose_node_by_bandwidth_weights()) on its 'w' line, which would override the values from the consensus footer if present.
We would then compute these Wg weights for each relay at the authorities depending on how old of a guard a relay is using a scaling function similar to what arma/I mentioned earlier (probably a simple linear function that represents a constant rate of client arrival and migration until we hit an age greater than the rotation period).
We'd also probably want to alter the bandwidth-weight computation at the end to multiply these overrides by the relay bandwidth, as if that multiplied value were the total bandwidth for that relay for that flag. This would give us more realistic fractions of how much bandwidth actually is being actively used for the Guard vs other positions during any given consensus period.
Once those two changes are made, we should be free to make this value as large as we want without impacting balancing significantly, I think. We should also be able to observe in practice that getting the Guard flag should no longer cause your relay to suddenly drop in traffic volume, so it will hopefully be obvious if its actually working.
I've made arma's changes in branch "bug8240_v2", still on 0.2.3. I'm satisfied with clamping the option silently for now, given that I added documentation. 10 minutes is insanely low; I guess we could make it possible for testingtornetwork purposes, but that seems like a feature, and therefore 0.2.5 stuff. Adding a lower consensus-based minimum I kinda want to lump into the same category.
Putting this back in needs_review: shall I forward-port to 0.2.4 and merge?