Right now, NumEntryGuards tries to do some magic to set "guard-n-primary-guards-to-use" to the torrc value, and "guard-n-primary-guards" to twice that value.
This prevents us from testing our current favored Proposal 291 params (2 for each). So the torrc option could either be changed to set both of those values to the same, or we could make two torrc options.
We should also ensure that whatever we do, we have the ability to set the torrc such that we could get the same behavior as all existing clients would get, with those consensus params, if we wanted. This includes directory guards (which we currently believe will and should be the same as the two primary guards).
Here's a list of things we're trying to learn from this testing:
How often does the code decide that one of these two primaries is "down" when it is not?
How often does the code prefer one guard over the other? (They should be split roughly 50/50 with this patch as-is, unless you get unlucky with path restrictions.. does that happen a lot?).
How often do we decide to use guards other than our two primaries with this patch?
What circumstances cause us to use guards other than our two primaries with this patch?
Do we use the same two directory guards as our primary guards?
Do we ever have microdescriptor shortages or 503 directory busy issues with this patch?
What happens when we wander into the uncharted "sampled guard" territory of prop271?
Do our failure modes for the above/other issues ever result in complete downtime for the client? (Can we fix that easily?)
Can the client be induced to spam or otherwise thrash on its guards when it thinks one or both are down/unreachable?
How does the vanguard controller behave with this patch?
Please let me know if this behavior is not right for upstream. I thought about it for a bit, and decided that it's a reasonable behavior and probably less arbitrary than the previous one. It's always a bit unclear what users who set NumEntryGuards are thinking, so since we (the devs) are thinking of adopting this behavior, it's probably the right one.
Here's a list of things we're trying to learn from this testing:
OK, I've been running the above branch for a week or so. Here are the things I can answer already:
How often does the code decide that one of these two primaries is "down" when it is not?
This is the same behavior as single-guard, so not much has changed here.
Tor doesn't have many false positives here when the network is good, but it can have false negatives [i.e. tor keeps on thinking that a guard is up, but that guard is overheating and sending DESTROYs to every circuit (#25347 (moved))].
When the network is bad, it's a totally different situation. See question 3.
How often does the code prefer one guard over the other? (They should be split roughly 50/50 with this patch as-is, unless you get unlucky with path restrictions.. does that happen a lot?).
I think we good here. It's a smartlist_choose() in select_entry_guard_for_circuit().
How often do we decide to use guards other than our two primaries with this patch?
What circumstances cause us to use guards other than our two primaries with this patch?
This is related to question 1. Usually this can happen when the network is bad, and Tor thinks that some guards are down when they are not. There are cases where Tor can end up thinking that the primaries are down, and it will use the guards below the primaries (i.e. enter the wilderness). I tested this manually using iptables and I immediately found #25783 (moved). We need to fix this bug and re-test this, to see what else appears.
The iptables command was a simple:
iptables -A OUTPUT -d 202.54.1.22 -j DROP
where 202.54.1.22 was my top primary guard.
Do we use the same two directory guards as our primary guards?
Yes. We only use our two primary guards for directory guards.
Do we ever have microdescriptor shortages or 503 directory busy issues with this patch?
I haven't encountered such a thing yet, but also I haven't looked for it. I haven't encountered #21969 (moved) either.
What happens when we wander into the uncharted "sampled guard" territory of prop271?
Do our failure modes for the above/other issues ever result in complete downtime for the client? (Can we fix that easily?)
I haven't done enough testing here, but sometimes things work perfectly, and some other times Tor gets stuck for a bit and then gets unstuck and works fine. In the case of #25783 (moved) Tor got stuck for about 6 minutes tho... We really need to look into this more.
Can the client be induced to spam or otherwise thrash on its guards when it thinks one or both are down/unreachable?
This is #25347 (moved). Tor will keep on trying to establish circuits to its guards even tho they are overloaded and sending DESTROYs all the time. There is no single best approach to solve that problem.
How does the vanguard controller behave with this patch?
This seems good for testing, but I don't think this is actually the right behavior if we want to make these values adjustable IRL. Instead, I think we should make separate, independent configuration options. That way, if we (or anybody else!) want to experiment with different values, we can actually experiment with the full range of possibilities.
This seems good for testing, but I don't think this is actually the right behavior if we want to make these values adjustable IRL. Instead, I think we should make separate, independent configuration options. That way, if we (or anybody else!) want to experiment with different values, we can actually experiment with the full range of possibilities.
Good point. Please check branch bug25843_v2 which introduces the NumPrimaryGuards torrc option.