Directory authorities need to track how much of the past n months each relay was around and had the Guard flag.
They vote a percentage for each relay in their vote, and the consensus has a new keyword on the w line so clients can learn how Guardy each relay has been.
Clients change their load balancing algorithm to consider how Guardy you've been, rather than just treating Guard status as binary (#8453 (moved)).
Raise the guard rotation period a lot (#8240 (moved)).
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
Directory authorities need to track how much of the past n months each relay was around and had the Guard flag.
They vote a percentage for each relay in their vote, and the consensus has a new keyword on the w line so clients can learn how Guardy each relay has been.
Clients change their load balancing algorithm to consider how Guardy you've been, rather than just treating Guard status as binary.
Raise the guard rotation period a lot (#8240 (moved)).
to
Here's our plan:
Directory authorities need to track how much of the past n months each relay was around and had the Guard flag.
They vote a percentage for each relay in their vote, and the consensus has a new keyword on the w line so clients can learn how Guardy each relay has been.
Clients change their load balancing algorithm to consider how Guardy you've been, rather than just treating Guard status as binary (#8453 (moved)).
Raise the guard rotation period a lot (#8240 (moved)).
take into account available bw on relay: advertised bw - actual bw used. Slow guard is really bad for user experience.
I don't think we have quick enough feedback to make this work right. Instead, I think something like conflux's "adapt which circuit you use based on round-trip times" is going to serve us better for this one.
exit nodes should not be used as guard too often, it wastes their bw.
You need to know GB/day from relay, there is not much day to day variance. This is for load balancing enough. Bit problematic can be relays with not fixed bw limit, their GB/day depends on weight assigned way more then relays with fixed bw limit.
You need to watch for trends in young relays < 1month. They continually increasing their traffic, increase rate seems to be constant.
If you plan to have just 2 guards and long cycle time, then its important to prefer guards with spare bandwidth.
FWIW, here is our plan from proposal 236 wrt this ticket:
A guard N that has been visible for V out of NNN*30*24 consensuses has had the opportunity to be chosen as a guard by approximately F = V/NNN*30*24 of the clients in the network, and the remaining 1-F fraction of the clients have not noticed this change. So when being chosen for middle or exit positions on a circuit, clients should treat N as if F fraction of its bandwidth is a guard (respectively, dual) node and (1-F) is a middle (resp, exit) node. Let Wpf denote the weight from the 'bandwidth-weights' line a client would apply to N for position p if it had the guard flag, Wpn the weight if it did not have the guard flag, and B the measured bandwidth of N in the consensus. Then instead of choosing N for position p proportionally to Wpf*B or Wpn*B, clients should choose N proportionally to F*Wpf*B + (1-F)*Wpn*B. Similarly, when calculating the bandwidth-weights line as in section 3.8.3 of dir-spec.txt, directory authorities should treat N as if fraction F of its bandwidth has the guard flag and (1-F) does not. So when computing the totals G,M,E,D, each relay N with guard visibility fraction F and bandwidth B should be added as follows: G' = G + F*B, if N does not have the exit flag M' = M + (1-F)*B, if N does not have the exit flag D' = D + F*B, if N has the exit flag E' = E + (1-F)*B, if N has the exit flag
I've done a bit of progress on this. I have a stupid Python script
that you can point to a directory with consensus documents. It will
parse them all (using stem), and for each guard it will spit out the
number of consensuses it was mentioned in, as well as when was the
earliest and the latest consensus it was in.
The idea is that this script will have to be finished, and then
somehow executed in the authorities. The script will spit out an
output file that can be parsed by little-t-tor in the same fashion as
the bandwidth authorities do (see measured_bw_line_parse()).
Few questions that will need to be answered:
Will the script be called periodically and the authorities will have
to parse the output file every once in a while? Or will the script
be ran once, and then it's the job of the authorities to internally
update their state with new information?
I'm currently aiming for the former behavior, to minimize the amount
of code that needs to be written for little-t-tor. OTOH, this means
that authorities will need to keep 9 months worth of consensuses in
their filesystem. As we move closer to completion of this task we
will see if the former behavior is indeed better.
Where will the consensuses be fetched from? To run this script we
need to have a directory filled with consensuses. How are we going
to get those documents? rsync cronjob from metrics? Does this scale?
What else can we do?
What should we do about consensus signatures? If we are fetching
consensuses from metrics, it's reasonable that we don't want to
trust them blindly. It would be nice if the script (or the auth)
could validate the consensus signatures, but it's not an easy task:
How will the script get the public keys of the auths? What if the
auths set change? What if (as part of an attack) we are given a
consensus with only one or two auth signatures? Should it be
accepted even though it's signed by a minority of auths? Should our
stupid script understand all these consesus security details?
What should happen if we are missing a few consensuses? Sometimes
the auths fail to establish a consensus, so it's reasonable that a
few consesuses will be missing if we look 9 months back. Should our
Python script recognize their absense? What if half of the
consesuses are missing? What if just one is missing? How should our
script react?
We need to think how the guardiness information will be passed to
clients (since clients need to change their path selection procedure
according to the guardiness). Proposal 236 simply says:
{{{
The authorities include the age of each guard by appending
'[SP "GV=" INT]' in the guard's "w" line.
}}}
But I don't think that's enough. What is an age anyway?
Should we pass to clients information like "Node was a guard for
3031/6574 consensuses during the past 9 months"? Should this be
passed in the consensus as part of the 'w' line or something?
How should this be deployed?
In the beginning it should parse 2-3 months worth of consesuses (the
current guard lifetime period), and then as more clients upgrade we
should make it parse 9-10 months (or whatever we decide) worth of
consesuses?
I'm planning to tackle these questions soon. Any feedback is helpful!
Also, any ideas on the format of the file that the script should output?
What would be the easiest format for little-t-tor to parse these days?
I'm looking at dirserv_read_measured_bandwidths() but the file format of the bw auths looks a bit arbitrary. It's pretty easy to parse but it's not something that can be reused. Can I reuse a file parser of little-t-tor, or should I write yet another file format?
The information that needs to be transmitted could be something like
this (maybe I'm forgetting some info, or adding redundant info):
<date and time><number of consesuses parsed> <number of months considered><guard fpr 1> <number of times seen in a consensus><guard fpr 2> <number of times seen in a consensus><guard fpr 3> <number of times seen in a consensus><guard fpr 4> <number of times seen in a consensus><guard fpr 5> <number of times seen in a consensus>...
I also attached some initial results in results just to get an idea of what's going on.
It's 3 months worth of consensuses, and the computation took around 15 minutes on a decent box.
Will the script be called periodically and the authorities will have
to parse the output file every once in a while? Or will the script
be ran once, and then it's the job of the authorities to internally
update their state with new information?
I'm currently aiming for the former behavior, to minimize the amount
of code that needs to be written for little-t-tor. OTOH, this means
that authorities will need to keep 9 months worth of consensuses in
their filesystem. As we move closer to completion of this task we
will see if the former behavior is indeed better.
FWIW, I agree this is probably the right design, though parsing 9 months worth of consensuses with stem is no mean feat. An alternative would be to have the script keep a summary file that is updated as new consensuses are fetched; it might store, e.g. the number of consensuses a relay appeared in for each day, and then could get batch updated.
What should we do about consensus signatures? If we are fetching
consensuses from metrics, it's reasonable that we don't want to
trust them blindly. It would be nice if the script (or the auth)
could validate the consensus signatures, but it's not an easy task:
I recently submitted a patch to stem (#11045 (closed)) that can check consensus signatures, and certificates, so that part should be simple. Getting the certificates should also be easy -- every directory server serves the list of current certs at http://:/tor/keys/all.z and every tor client stores the list in .tor/cached-certs . stem will happily parse these with parse_file(), though you might need to give it a hint about the file type. Getting historical keys is more problematic; the script probably needs to retain a cache of old certs. (Incidentally, metrics serves this, it's not very big.)
What should happen if we are missing a few consensuses? Sometimes
the auths fail to establish a consensus, so it's reasonable that a
few consesuses will be missing if we look 9 months back. Should our
Python script recognize their absense? What if half of the
consesuses are missing? What if just one is missing? How should our
script react?
These are good questions. At the winter dev meeting, I think Nick suggested that when the auths fail to establish a consensus, there should be a signed "consensus failed" message. I guess this requires another proposal, but if we had it going forward, then there shouldn't be a missing consensus.
We need to think how the guardiness information will be passed to
clients (since clients need to change their path selection procedure
according to the guardiness). Proposal 236 simply says:
The authorities include the age of each guard by appending'[SP "GV=" INT]' in the guard's "w" line.}}}But I don't think that's enough. What is an age anyway?Should we pass to clients information like "Node was a guard for3031/6574 consensuses during the past 9 months"? Should this bepassed in the consensus as part of the 'w' line or something?
Agreed that the number of consensuses counted for GV should appear in the consensus; it could be in the w line or could also appear once in the document, e.g. before 'bandwidth-weights' there could be a line of the form
{{{
"guard-visibility-max" INT NL
And this would help with the deployment also; we could start with 3 months' worth of consensuses and just add consensuses as they are produced.