Load balance right when we have higher guard rotation periods

changed milestone to %Tor: 0.2.6.x-final

added 026-triaged-1 component::core tor/tor milestone::Tor: 0.2.6.x-final needs-proposal nickm-review parent::11480 priority::high resolution::implemented status::closed tor-auth tor-client type::project unfrozen labels

Trac:
Keywords: N/A deleted, needs-proposal added

Trac:
Keywords: needs-proposal deleted, needs-proposal tor-auth tor-client added

Trac:
Description: Here's our plan:

Directory authorities need to track how much of the past n months each relay was around and had the Guard flag.
They vote a percentage for each relay in their vote, and the consensus has a new keyword on the w line so clients can learn how Guardy each relay has been.
Clients change their load balancing algorithm to consider how Guardy you've been, rather than just treating Guard status as binary.
Raise the guard rotation period a lot (#8240 (moved)).

to

Here's our plan:

Directory authorities need to track how much of the past n months each relay was around and had the Guard flag.
They vote a percentage for each relay in their vote, and the consensus has a new keyword on the w line so clients can learn how Guardy each relay has been.
Clients change their load balancing algorithm to consider how Guardy you've been, rather than just treating Guard status as binary (#8453 (moved)).
Raise the guard rotation period a lot (#8240 (moved)).

Trac:
Parent: N/A to #8453 (moved)

take into account available bw on relay: advertised bw - actual bw used. Slow guard is really bad for user experience.

exit nodes should not be used as guard too often, it wastes their bw.

Trac:
Username: hsn

Replying to hsn:

take into account available bw on relay: advertised bw - actual bw used. Slow guard is really bad for user experience.

I don't think we have quick enough feedback to make this work right. Instead, I think something like conflux's "adapt which circuit you use based on round-trip times" is going to serve us better for this one.

exit nodes should not be used as guard too often, it wastes their bw.

I agree.

You need to know GB/day from relay, there is not much day to day variance. This is for load balancing enough. Bit problematic can be relays with not fixed bw limit, their GB/day depends on weight assigned way more then relays with fixed bw limit.

You need to watch for trends in young relays < 1month. They continually increasing their traffic, increase rate seems to be constant.

If you plan to have just 2 guards and long cycle time, then its important to prefer guards with spare bandwidth.

Trac:
Username: hsn

Trac:
Cc: N/A to amj703

Trac:
Milestone: Tor: 0.2.5.x-final to Tor: 0.2.6.x-final

Trac:
Parent: #8453 (moved) to #11480 (moved)
Keywords: needs-proposal tor-auth tor-client deleted, tor-client, needs-proposal, 026, tor-auth added

Trac:
Keywords: 026 deleted, 026-triaged-1 added

FWIW, here is our plan from proposal 236 wrt this ticket:

   A guard N that has been visible for V out of NNN*30*24 consensuses
   has had the opportunity to be chosen as a guard by approximately
   F = V/NNN*30*24 of the clients in the network, and the remaining
   1-F fraction of the clients have not noticed this change.  So when
   being chosen for middle or exit positions on a circuit, clients
   should treat N as if F fraction of its bandwidth is a guard
   (respectively, dual) node and (1-F) is a middle (resp, exit) node.
   Let Wpf denote the weight from the 'bandwidth-weights' line a
   client would apply to N for position p if it had the guard
   flag, Wpn the weight if it did not have the guard flag, and B the
   measured bandwidth of N in the consensus.  Then instead of choosing
   N for position p proportionally to Wpf*B or Wpn*B, clients should
   choose N proportionally to F*Wpf*B + (1-F)*Wpn*B.

   Similarly, when calculating the bandwidth-weights line as in
   section 3.8.3 of dir-spec.txt, directory authorities should treat N
   as if fraction F of its bandwidth has the guard flag and (1-F) does
   not.  So when computing the totals G,M,E,D, each relay N with guard
   visibility fraction F and bandwidth B should be added as follows:

   G' = G + F*B, if N does not have the exit flag
   M' = M + (1-F)*B, if N does not have the exit flag
   D' = D + F*B, if N has the exit flag
   E' = E + (1-F)*B, if N has the exit flag

I've done a bit of progress on this. I have a stupid Python script that you can point to a directory with consensus documents. It will parse them all (using stem), and for each guard it will spit out the number of consensuses it was mentioned in, as well as when was the earliest and the latest consensus it was in.

The idea is that this script will have to be finished, and then somehow executed in the authorities. The script will spit out an output file that can be parsed by little-t-tor in the same fashion as the bandwidth authorities do (see measured_bw_line_parse()).

Few questions that will need to be answered:

Will the script be called periodically and the authorities will have to parse the output file every once in a while? Or will the script be ran once, and then it's the job of the authorities to internally update their state with new information?

I'm currently aiming for the former behavior, to minimize the amount of code that needs to be written for little-t-tor. OTOH, this means that authorities will need to keep 9 months worth of consensuses in their filesystem. As we move closer to completion of this task we will see if the former behavior is indeed better.
Where will the consensuses be fetched from? To run this script we need to have a directory filled with consensuses. How are we going to get those documents? rsync cronjob from metrics? Does this scale? What else can we do?
What should we do about consensus signatures? If we are fetching consensuses from metrics, it's reasonable that we don't want to trust them blindly. It would be nice if the script (or the auth) could validate the consensus signatures, but it's not an easy task:

How will the script get the public keys of the auths? What if the auths set change? What if (as part of an attack) we are given a consensus with only one or two auth signatures? Should it be accepted even though it's signed by a minority of auths? Should our stupid script understand all these consesus security details?
What should happen if we are missing a few consensuses? Sometimes the auths fail to establish a consensus, so it's reasonable that a few consesuses will be missing if we look 9 months back. Should our Python script recognize their absense? What if half of the consesuses are missing? What if just one is missing? How should our script react?
We need to think how the guardiness information will be passed to clients (since clients need to change their path selection procedure according to the guardiness). Proposal 236 simply says: {{{ The authorities include the age of each guard by appending '[SP "GV=" INT]' in the guard's "w" line. }}} But I don't think that's enough. What is an age anyway?

Should we pass to clients information like "Node was a guard for 3031/6574 consensuses during the past 9 months"? Should this be passed in the consensus as part of the 'w' line or something?
How should this be deployed?

In the beginning it should parse 2-3 months worth of consesuses (the current guard lifetime period), and then as more clients upgrade we should make it parse 9-10 months (or whatever we decide) worth of consesuses?

I'm planning to tackle these questions soon. Any feedback is helpful!

Also, any ideas on the format of the file that the script should output? What would be the easiest format for little-t-tor to parse these days?

I'm looking at dirserv_read_measured_bandwidths() but the file format of the bw auths looks a bit arbitrary. It's pretty easy to parse but it's not something that can be reused. Can I reuse a file parser of little-t-tor, or should I write yet another file format?

The information that needs to be transmitted could be something like this (maybe I'm forgetting some info, or adding redundant info):

<date and time>
<number of consesuses parsed> <number of months considered>

<guard fpr 1> <number of times seen in a consensus>
<guard fpr 2> <number of times seen in a consensus>
<guard fpr 3> <number of times seen in a consensus>
<guard fpr 4> <number of times seen in a consensus>
<guard fpr 5> <number of times seen in a consensus>
...

Trac:
results

some initial results

I also attached some initial results in results just to get an idea of what's going on. It's 3 months worth of consensuses, and the computation took around 15 minutes on a decent box.

Replying to asn:

Few questions that will need to be answered:

Will the script be called periodically and the authorities will have to parse the output file every once in a while? Or will the script be ran once, and then it's the job of the authorities to internally update their state with new information?

I'm currently aiming for the former behavior, to minimize the amount of code that needs to be written for little-t-tor. OTOH, this means that authorities will need to keep 9 months worth of consensuses in their filesystem. As we move closer to completion of this task we will see if the former behavior is indeed better.

FWIW, I agree this is probably the right design, though parsing 9 months worth of consensuses with stem is no mean feat. An alternative would be to have the script keep a summary file that is updated as new consensuses are fetched; it might store, e.g. the number of consensuses a relay appeared in for each day, and then could get batch updated.

What should we do about consensus signatures? If we are fetching consensuses from metrics, it's reasonable that we don't want to trust them blindly. It would be nice if the script (or the auth) could validate the consensus signatures, but it's not an easy task:

I recently submitted a patch to stem (#11045 (closed)) that can check consensus signatures, and certificates, so that part should be simple. Getting the certificates should also be easy -- every directory server serves the list of current certs at http://:/tor/keys/all.z and every tor client stores the list in .tor/cached-certs . stem will happily parse these with parse_file(), though you might need to give it a hint about the file type. Getting historical keys is more problematic; the script probably needs to retain a cache of old certs. (Incidentally, metrics serves this, it's not very big.)

What should happen if we are missing a few consensuses? Sometimes the auths fail to establish a consensus, so it's reasonable that a few consesuses will be missing if we look 9 months back. Should our Python script recognize their absense? What if half of the consesuses are missing? What if just one is missing? How should our script react?

These are good questions. At the winter dev meeting, I think Nick suggested that when the auths fail to establish a consensus, there should be a signed "consensus failed" message. I guess this requires another proposal, but if we had it going forward, then there shouldn't be a missing consensus.

We need to think how the guardiness information will be passed to clients (since clients need to change their path selection procedure according to the guardiness). Proposal 236 simply says:
The authorities include the age of each guard by appending
'[SP "GV=" INT]' in the guard's "w" line.
}}}
But I don't think that's enough. What is an age anyway?

Should we pass to clients information like "Node was a guard for
3031/6574 consensuses during the past 9 months"? Should this be
passed in the consensus as part of the 'w' line or something?

Agreed that the number of consensuses counted for GV should appear in the consensus; it could be in the w line or could also appear once in the document, e.g. before 'bandwidth-weights' there could be a line of the form

{{{ "guard-visibility-max" INT NL

And this would help with the deployment also; we could start with 3 months' worth of consensuses and just add consensuses as they are produced.

Load balance right when we have higher guard rotation periods

Child items 0

Activity