Bugs like #12466 (moved) and #12450 (moved) have made it clear that the data structures and methods of the guard nodes code are not very robust. To fix those tickets with the current data structures, we would need even more kludges that would lower the code quality.
This is a ticket about finding a more elegant approach to keeping entry guards and directory guards.
I'm putting this in 0.2.6 milestone because it's important, but it might get deferred if it's too much work.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
One of the possible designs discussed during the dev meeting, is to have two guard lists. One list is the entry_guards list, and the other is the directory_guards list.
The idea is that if your primary entry guard is not a directory cache, then you use the directory_guards list to get a directory guard. Or maybe you just use the directory_guards list for all your directory needs anyway. Also, in the future when #12597 (moved) becomes totally deployed, we will be able to move to a single list, where all nodes will be both directory guards and entry guards.
Some more ideas that were thrown around during the dev meeting, and would make guard selection smoother:
All guards should be directory caches which will happen as part of #12597 (moved).
All guards should be considered fast/stable, orRelays can be guards iff they are fast/stable. This will make it more likely that guards on the top of our list can be used for our purposes, since we will not have to skip guards till we find a stable guard for stable circuits. However, because of the analysis in #12204 (moved), this is not very likely to happen anyway.
Also, it's worth noting that designing data structures with the assumption that we are only going to have 1 primary guard is probably easier than designing data structures that can accommodate N primary guards.
One of the possible designs discussed during the dev meeting, is to have two guard lists. One list is the entry_guards list, and the other is the directory_guards list.
Seems plausible to me.
Some more ideas that were thrown around during the dev meeting, and would make guard selection smoother:
All guards should be directory caches which will happen as part of #12597 (moved).
Remember that we can't rely on "All guards are directory caches" until all server versions before 0.2.6 are obsolete.
All guards should be considered fast/stable, orRelays can be guards iff they are fast/stable. This will make it more likely that guards on the top of our list can be used for our purposes, since we will not have to skip guards till we find a stable guard for stable circuits. However, because of the analysis in #12204 (moved), this is not very likely to happen anyway.
"All guards should be considered fast/stable" seems okay to me, unless I'm missing something.
Also, it's worth noting that designing data structures with the assumption that we are only going to have 1 primary guard is probably easier than designing data structures that can accommodate N primary guards.
Hmmm. No, I don't agree with that one. We need to have some notion of what we're going to do when we move back and forth between versions of Tor that use the old state file and the new one. We should also know what to do when our main guard is down for a while, then comes back up. I think that the data structure we need to handle both of those problems will be not too far from the one that we would need to keep support for N guards.
Since we want our circuit guards to also be our directory guards (if
possible), we should probably use a single entry guard list (like we
do currently), instead of using separate lists for circuit guards
and directory guards. Reasons for this can be found here:
https://lists.torproject.org/pipermail/tor-dev/2014-May/006824.html
I wonder if this behavior is any different from the corresponding
behavior of circuit guards: "If our circuit guard fails our circuit,
we have to go ahead and ask the next circuit guard"
If not, maybe we could just switch NumDirectoryGuards to 1 too, and
just make sure that if a microdescriptor gets denied we move to the
next directory guard, till we have enough microdescriptor to be happy?
(logic similar to compute_frac_paths_available()).
With the current code, and if NumDirectoryGuards was 3, from the
first entry guard list we would select a directory guard between
(entry2, entry4, entry6). On the second list, we would select
between (entry1, entry4, entry6). On the third list, the worst case
scenario, we would select between (entry4, entry5, ...).
I'd argue that we should strongly prefer the top directory guard
every time, and only move to the lower ones if the top one doesn't
give us what we want.
With the current code, and if NumDirectoryGuards was 3, from the
first entry guard list we would select a directory guard between
(entry2, entry4, entry6). On the second list, we would select
between (entry1, entry4, entry6). On the third list, the worst case
scenario, we would select between (entry4, entry5, ...).
I'd argue that we should strongly prefer the top directory guard
every time, and only move to the lower ones if the top one doesn't
give us what we want.
My point here was that NumEntryGuards and NumDirectoryGuards probably does not work the way people expect it to work.
In the current code, NumDirectoryGuards (and NumEntryGuards) being 3 means that Tor needs to have 3 choices when picking a entry/directory guard. This means, that Tor will go as deep as needed in the guard list till it gets 3 nodes that satisfy the circuit requirements. This means that Tor might pick between the first, the second and the 25th entry guard, if the circuit requirements are such. This is what causes #12466 (moved).
OTOH, when I think of "Tor has 3 entry guards", I'm thinking that Tor will try its best to push traffic through one of three static nodes; but this is not what the current behavior does. To do that, we would need to have a concept of "primary guards" (maybe the top 3 guards in the guard list), and Tor would have to make sure that all traffic (including directory traffic) can be pushed through one of those 3 guards.
All these concepts are hairy and weird. Moving to a single entry/directory guard will simplify the logic here.
Another thing that the new data structures should fix is the #12450 (moved) behavior. That is, the part of Tor that figures out that the net was disabled and just came back up, and hence we need to retry our previous guards.
If we just added a new entry guard to our list and we managed to connect to it, then we assume that the network just came back up: we enable the can_retry field of all previous guards and kill the current connection to the new guard. Then we retry guards from the beginning.
The assumption here is that since we had to add a new entry guard to the list we have already exhausted all the already existing entry guards, and hence we had a network down scenario.
This assumption is OK but not complete since it does not consider some edge cases
As an example, it causes the #12450 (moved) bug, where the network comes back up before we exhaust all the already existing entry guards. Since we didn't have to add a new entry guard to the list, the network up event is not detected and we keep on using that entry guard we ended up with.
The new data structures and methods should take this into account.
There are various kludg!^Wways to fix this issue. Some ideas:
a) Everytime we manage to connect to a guard, if it's not the first guard in our list, we mark all previous guards as retriable, and try from the top.
b) Everytime we manage to connect to a guard, if we have previously failed to connect to other guards in this session, we mark all other guards as retriable and try from the top.
c) Everytime we manage to connect to a guard, if we have previously failed to connect to a guard in this session, we mark those previously attempted guards as retriable and try from the top.
All the above ideas will also need some way to stop them from infinite looping.
From the above ideas, I can see some problems with (b) and (c) because its memory is restricted to a single session. Example: Network goes down, Tor tries some guards and marks them as unreachable. Tor gets shut down. Tor starts up again when the network is back up, manages to connect to a guard and stays with that guard (which is not the top one).
(a) is a bit more bulletproof since it will always try to use the top guards, but it also has its problems. First of all, there needs to be a system to stop it from infinite looping, since if your first guard is actually down, we shouldn't keep on retrying it.
Assuming that there is some sort of infinite loop protection, you can imagine the following (unlikely) race condition problem. Tor starts up, network down, Tor tries various guards. Network goes up and we connect to a guard. We notice that network goes up, and we mark all previous guards as retriable. Before starting from the top, network goes down again, Tor finds the first few guards unreachable, and then network goes up again and we connect to the entry guard we were examining at that point (which is not the top one).
Furthermore, it's worth having in mind that such probing behaviors can also act as a guard fingerprint (similar to #10969 (moved)). For example, let's assume the (a) behavior and assume that the first two guards of your entry guard list are actually down (quite common). Everytime you mark your guards as retriable and go from the top, you will also probe those two entry guards that are down. Your ISP will be able to see both of those probes plus the connection to the third guard that actually succeeds. These three connections should be enough to form a fingerprint.
There are various kludg!^Wways to fix this issue. Some ideas:
a) Everytime we manage to connect to a guard, if it's not the first guard in our list, we mark all previous guards as retriable, and try from the top.
b) Everytime we manage to connect to a guard, if we have previously failed to connect to other guards in this session, we mark all other guards as retriable and try from the top.
c) Everytime we manage to connect to a guard, if we have previously failed to connect to a guard in this session, we mark those previously attempted guards as retriable and try from the top.
All the above ideas will also need some way to stop them from infinite looping.
FWIW, the kludges above are needed because we don't have a "Is my network down?" primitive.
And since we don't have such a primitive, we can't distinguish between "Guard is marked unreachable because it was down" and "Guard is marked unrecahble because the network was down". If we could distinguish between those two cases, then this task would be much easier.
Is there a way to build such a primitive in a secure and scaleable fashion? For example, we could imagine that clients ping the authorities to check if their network is up. But this puts more load to the authorities and it doesn't scale well.
Since we want our circuit guards to also be our directory guards (if
possible), we should probably use a single entry guard list (like we
do currently), instead of using separate lists for circuit guards
and directory guards. Reasons for this can be found here:
https://lists.torproject.org/pipermail/tor-dev/2014-May/006824.html
I wonder if this behavior is any different from the corresponding
behavior of circuit guards: "If our circuit guard fails our circuit,
we have to go ahead and ask the next circuit guard"
If not, maybe we could just switch NumDirectoryGuards to 1 too, and
just make sure that if a microdescriptor gets denied we move to the
next directory guard, till we have enough microdescriptor to be happy?
(logic similar to compute_frac_paths_available()).
Watch out there. "compute_frac_paths_available()" still allows some epistemic bias. It only insists that we have a large proportion of possible microdescriptors before we're happy enough to build circuits, not that we have all of them. That's not such a big deal with multiple noncolluding directory guards, but with only one guard, it's possibly trouble.
With the current code, and if NumDirectoryGuards was 3, from the
first entry guard list we would select a directory guard between
(entry2, entry4, entry6). On the second list, we would select
between (entry1, entry4, entry6). On the third list, the worst case
scenario, we would select between (entry4, entry5, ...).
I'd argue that we should strongly prefer the top directory guard
every time, and only move to the lower ones if the top one doesn't
give us what we want.
"Strongly prefer" seems good; can we turn this into an algorithm?
Reading through all of the above stuff, in fact, I think that the right way to approach design here might be to step back from the "what is the right data structure" question and ask ourselves, "what interface must these algorithms expose, and how should they implement it?" IOW, can we enumerate their inputs and outputs, the events that affect them, and the operations they need to perform? If we figure that out, we should be able to examine some ideas in pseudocode and converge on something clever.
Since we want our circuit guards to also be our directory guards (if
possible), we should probably use a single entry guard list (like we
do currently), instead of using separate lists for circuit guards
and directory guards. Reasons for this can be found here:
https://lists.torproject.org/pipermail/tor-dev/2014-May/006824.html
I wonder if this behavior is any different from the corresponding
behavior of circuit guards: "If our circuit guard fails our circuit,
we have to go ahead and ask the next circuit guard"
If not, maybe we could just switch NumDirectoryGuards to 1 too, and
just make sure that if a microdescriptor gets denied we move to the
next directory guard, till we have enough microdescriptor to be happy?
(logic similar to compute_frac_paths_available()).
Watch out there. "compute_frac_paths_available()" still allows some epistemic bias. It only insists that we have a large proportion of possible microdescriptors before we're happy enough to build circuits, not that we have all of them. That's not such a big deal with multiple noncolluding directory guards, but with only one guard, it's possibly trouble.
Yep, we should be careful.
I was thinking of an algorithm like this:
Fetch all mds from first dirguardIf all mds got fetched successfully: Done!else: # If any mds got denied try next dirguard (even if we have enough dir info ) while True: # Keep on asking other dirguards till we have enough dir info Get missing mds from the next dirguard if have_minimum_dir_info(): Done
Which would basically ensure that we have tried at least two dirguards if the first one didn't serve us all the microdescriptors we were looking for. After we have asked two dirguards, we exit when we get enough directory info.
I guess the idea is that the first dirguard might lie, but the probability of both dirguards lying is less. The algorithm can also be adapted to ask the first three dirguards (if we prefer the number 'three' instead of 'two').
I think this kind of algorithm ("Try guards sequentially till you get what you want") is a better approach than "Choose between the 3 top currently active guards in your list." since it avoids bugs like #12466 (moved).
A big engineering problem with this idea, is that the networking logic of Tor is probably not ready to support this feature and will require non-trivial changes. For example, we will need some kind of logic that marks dirguards as skippable if they failed to deliver a microdescriptor, and then the next dirguard query would use the next guard. I can imagine implementation of this feature to get quite complicated.
To better understand whether such an algorithm would work (and how often it would need to ask more dirguards), we need to check how frequent microdescriptor fetch failures are in the wild right now.
...If the above is the only case, we can fix it by making sure thatdirectory servers start serving a consensus _only after_ they havedownloaded the descriptors of all the routers mentioned in thatconsensus....
With the current code, and if NumDirectoryGuards was 3, from the
first entry guard list we would select a directory guard between
(entry2, entry4, entry6). On the second list, we would select
between (entry1, entry4, entry6). On the third list, the worst case
scenario, we would select between (entry4, entry5, ...).
I'd argue that we should strongly prefer the top directory guard
every time, and only move to the lower ones if the top one doesn't
give us what we want.
"Strongly prefer" seems good; can we turn this into an algorithm?
My idea was the following "simple" algorithm:
node_t choose_live_circuit_guard(restrictions) { foreach guard in guard_list: if guard.should_be_used_as_circuit_guard(): return guard # No guard was found return find_and_add_new_guard()}
and
node_t choose_live_directory_guard(restrictions) { foreach guard in guard_list: if guard.should_be_used_as_dir_guard(): return guard # No guard was found return find_and_add_new_guard()}
should_be_used_as_directory_guard() is the same requirements as above plus any restrictions from my previous comment (comment:8) if we decide it's a good idea.
and find_and_add_new_guard() would basically be like tor's add_an_entry_guard() with chosen set to NULL.
The idea is to basically use the guard list as a conveyor belt, and if no items in the conveyor belt can satisfy us append a new one.
Reading through all of the above stuff, in fact, I think that the right way to approach design here might be to step back from the "what is the right data structure" question and ask ourselves, "what interface must these algorithms expose, and how should they implement it?" IOW, can we enumerate their inputs and outputs, the events that affect them, and the operations they need to perform? If we figure that out, we should be able to examine some ideas in pseudocode and converge on something clever.
I agree that this will be helpful.
When I started thinking about this problem, I wanted to formalize it
but I quickly found myself unsure of how much formalization is
actually productive here.
I found it helpful to think of this as an OOP system. I thought of a
class called Guard and a class called GuardList:
A Guard is basically a Tor node plus some metadata about it. Using
the metadata you can answer questions like "is it fast?", "is it
dir?", "was it unreachable?", "should we retry?", "is it too old?".
A GuardList contains many Guard objects and knows how to select
between them. It basically acts as a conveyor belt for them.
The main interface to the guard subsystem that Tor needs to use are
the functions that return entry guards and directory guards. In the
current codebase, this is choose_random_entry_impl().
We can imagine this function as a method of the GuardList that does
something similar to what was described in comment:9: it takes as
input some restrictions (the type of the circuit, the exit node
family, whether bridges are used etc.) and outputs a node that can be
used (or NULL if nothing could be found). The method has side-effects,
since the guard list might need to be extended to find a suitable
node.
Some more methods of the guard list could be:
load_guards(), which loads guards from the torrc and state file.
save_guards(), which saves guards to the state file.
refresh(), which kills old/bad guards based on a newly arrived
consensus (see entry_guards_compute_status())
network_is_back_up(), which is the "network up" trigger (see comment:5)
stuff like add_guard(), remove_guard(), get_num_guards(), etc.
Continuing with comment:9, the main methods of the Guard class would
be:
bool should_be_used_as_dir_guard(restrictions) and
bool should_be_used_as_circuit_guard(restrictions)
These methods take a look at the latest consensus, the metadata of
each guard and the restrictions imposed, to figure out whether a Guard
can be used as a guard right now. These methods will be used by the
GuardList when choosing a guard node.
Now, here are various events that need to be taken into account:
The network is down! (comment:5)
Directory guard does not have descriptor X (comment:8)
Top circuit guard is not a directory.
Guard used to be OK, but now: is not a guard anymore
is not a directory
is path bias disabled
is in the same family as exit
is too old
is an excluded node
Any suggestions on how I should be looking at this better?