Opened 8 years ago

Last modified 8 months ago

#2550 new enhancement

bwauth should reschedule quicker bandwidth test when bandwidthrate changes?

Reported by: arma Owned by: mikeperry
Priority: Medium Milestone:
Component: Core Tor/Torflow Version:
Severity: Normal Keywords:
Cc: aagbsn@…, juga Actual Points:
Parent ID: #13630 Points: ?
Reviewer: Sponsor:

Description

https://metrics.torproject.org/relay-search.html?search=AEIOUm+2011-02-13

He apparently switches between 100KB/s and 2MB/s bandwidthrate depending on time of day. His bwauth votes ended up being very skewed:

moria1 says
w Bandwidth=141 Measured=15

ides says
w Bandwidth=141 Measured=26

urras says
w Bandwidth=141 Measured=1480

gabelmoo says
w Bandwidth=141 Measured=935

It's a shame that we're giving really low numbers to a node that wants to be 2MB/s at some times of day. It probably also means we give high numbers to a slow node if the measurement times are different.

Child Tickets

Change History (16)

comment:1 Changed 8 years ago by mikeperry

So which value do we want to give him? We can only measure him about once every day at best, but probably more like every two or three.. So we basically have to decide when we choose to actually measure relays like this.

If we measure him during his burst period and give him the fast value, clients who use him when his capacity is lower will be crushed and have poor access. If we measure him during his slow period, he will be a faster relay to use at some points, but never a crushingly poor relay to use.

The second option seems best to me, but how do we even detect this pattern? It sounds like we need to do some sort of analysis on descriptors over time to even figure this out. A single descriptor won't even tell us this, right?

comment:2 Changed 8 years ago by mikeperry

Points: ?

comment:3 Changed 8 years ago by arma

A single descriptor won't tell us, correct. But two descriptors will. If we remember the bandwidthrate for each relay we've measured, then when we see a descriptor with a different rate, we can know to handle it specially.

Option 1 would be to discard our measurements if the rate changes. But that approach would exacerbate #2286.

Option 2 would be to scale our measurement by the ratio of the bandwidths. But maybe not raise it past a certain ratio.

Option 3 would be to remember n measurements for n bandwidthrates. Might not be worth the extra coding hassle.

You say "We can only measure him about once every day at best, but probably more like every two or three". Can't we just schedule a couple of quick tests right then when we see the bandwidthrate has changed? (The results of these tests could go along with options 1, 2, or 3 above.)

comment:4 Changed 8 years ago by mikeperry

Option 1 might work. We might be able to add logic to schedule a quick test as soon as we noticed that a descriptor has changed advertised bandwidth values significantly. This could create a situation where a relay could cause the bw scanners to make no progress though, if we reset what we have when we do this.. But if we don't discard, we'll have inaccurate results..

Option 2 is technically what we already do now: We use the previous consensus descriptor value to multiply the measured ratio against. It is possible this is not optimal because we don't fetch often enough, and need to set torrc options FetchDirInfoExtraEarly and FetchDirInfoEarly.

It is also possible that we are actually measuring a low ratio on some bw auths and a high ratio on others, just based on when we measure. Depending on if the relay is actually rate limiting, or just altering MaxAdvertisedBandwidth, the ratio we get could be either super low while the relay reports a low value, or super high...

Option 3 is unlikely to work out if the bandwidthrate change is too frequent.. We are not likely to gather measurements for all values.

comment:5 Changed 8 years ago by aagbsn

Cc: aagbsn@… added

comment:6 Changed 8 years ago by mikeperry

Option 3a is to remember the n most recent advertised values and take the lowest of those for use in the ratio computation.. This option could be combined with option 1, to reduce the number of n values we need to remember (which would be a function of option 1's expected measurement to client turnaround, or about 4-6 hours)

comment:7 Changed 8 years ago by mikeperry

To be on the safe side, it seems like a fine n to choose from corresponds to 24 hours of measurements, because the relays that do throttle will likely do it on a 24 hour window. Also, since advertised bw is max in a 24 hour window normally, this shouldn't change things much on for normal relays.

comment:8 Changed 8 years ago by arma

Replying to mikeperry:

Option 2 is technically what we already do now: We use the previous consensus descriptor value to multiply the measured ratio against. It is possible this is not optimal because we don't fetch often enough, and need to set torrc options FetchDirInfoExtraEarly and FetchDirInfoEarly.

It is also possible that we are actually measuring a low ratio on some bw auths and a high ratio on others, just based on when we measure. Depending on if the relay is actually rate limiting, or just altering MaxAdvertisedBandwidth, the ratio we get could be either super low while the relay reports a low value, or super high...

I meant something different by 'the ratio' than I think you do.

I meant the ratio of that relay's new capacity to its old capacity.

I think you mean the ratio of the performance we see from that relay compared to its peers (who advertise the same capacity).

So in my option 2, if the relay moves from 1000MB-but-we'd-advertise-3000MB to 100KB, then we advertise 300KB. It's quite a hack I admit.

Option 3 is unlikely to work out if the bandwidthrate change is too frequent.. We are not likely to gather measurements for all values.

No, option 3 is might work ok if the change is frequent. It will fail if the bandwidthrate changes to too many different values.

comment:9 in reply to:  8 ; Changed 8 years ago by mikeperry

Replying to arma:

Replying to mikeperry:

Option 2 is technically what we already do now: We use the previous consensus descriptor value to multiply the measured ratio against. It is possible this is not optimal because we don't fetch often enough, and need to set torrc options FetchDirInfoExtraEarly and FetchDirInfoEarly.

It is also possible that we are actually measuring a low ratio on some bw auths and a high ratio on others, just based on when we measure. Depending on if the relay is actually rate limiting, or just altering MaxAdvertisedBandwidth, the ratio we get could be either super low while the relay reports a low value, or super high...

I meant something different by 'the ratio' than I think you do.

I meant the ratio of that relay's new capacity to its old capacity.

I think you mean the ratio of the performance we see from that relay compared to its peers (who advertise the same capacity).

Yes, this value is multiplied by the advertised bandwidth of our choice, already. So we are already doing option 2, we just need to decide which adv value to use. Hence the rolling window thing which I will describe better in a followup comment.

So in my option 2, if the relay moves from 1000MB-but-we'd-advertise-3000MB to 100KB, then we advertise 300KB. It's quite a hack I admit.

Yes. Because we're dealing with ratios here internally, this is already what happens :)

Option 3 is unlikely to work out if the bandwidthrate change is too frequent.. We are not likely to gather measurements for all values.

No, option 3 is might work ok if the change is frequent. It will fail if the bandwidthrate changes to too many different values.

We don't measure frequently enough for this to mean anything. Hence, I proposed 3a.

comment:10 Changed 8 years ago by mikeperry

Ok, I am going to attempt to clearly and accurately combine option 1, 2 (which is done already), and 3a into "The Plan, v1".

First (option 3a), we use ensure we have a rolling window of at least 24 hours worth of advertised bandwidths for a relay. (This info is maintained in the BwHistory table, but we need a way to get it into aggregate.py).

Whenever we compute a measured value, we use the minimum value of this 24 hour window as the advertised bandwidth. Again, this advertised value is what is multiplied by the ratio of the relay's average stream bw to the network-wide average stream bw (in aggregate.py).

Now (option 1), we alter bwauthority.py to discard its results every time it notices a new minimum. This improves the turnaround time to measure a ratio for the node as it behaves while clients they are being crushed due to the throttle change. Thus, we should have a double-minimum for these relays. They would be super fun to use for all times other than when they are being crushed, and like normal relays during the "crush" period.

Again, Option 1 would need some additional smarts to ensure that it doesn't discard results too often: otherwise the bw auth would make no progress. Perhaps these smarts are based on how many measurements it managed to get since the last discard, and this is used to estimate a measurement frequency, to determine if the measurement frequency is high enough to make progress. Or perhaps these smarts are simpler (ie "do not discard more often than 1x per 24 hours").

In summary, option 3a allows us to report one minimum, while option 1 allows us to actually measure the severity of this minimum due to clients *also* not responding fast enough to the change.

I suppose these two still could be done independently. Perhaps they should in fact be separate child tickets.

comment:11 Changed 8 years ago by mikeperry

Implementation detail: Discarding results automatically causes bwauthority.py to remeasure those relays with fewest measurements due to the use of the ExactUniformGenerator in TorCtl/PathSupport.py to track the result count. Because of this, relays with discarded stream measurement values should be measured rapidly, before the completion of a slice. So no actual reported values would ever be ommitted. (Assuming the code updates the counter in ExactUniformGenerator while also discarding the measured stream bw values :).

comment:12 in reply to:  9 ; Changed 8 years ago by rransom

Replying to mikeperry:

Replying to arma:

So in my option 2, if the relay moves from 1000MB-but-we'd-advertise-3000MB to 100KB, then we advertise 300KB. It's quite a hack I admit.

Yes. Because we're dealing with ratios here internally, this is already what happens :)

So if I run a few exits that advertise 20kiB/s but actually push traffic at 150MiB/s, and then I make them start advertising 1000MiB/s, how much of the Tor network's exit traffic can I sniff?

comment:13 in reply to:  12 Changed 8 years ago by mikeperry

Replying to rransom:

Replying to mikeperry:

Replying to arma:

So in my option 2, if the relay moves from 1000MB-but-we'd-advertise-3000MB to 100KB, then we advertise 300KB. It's quite a hack I admit.

Yes. Because we're dealing with ratios here internally, this is already what happens :)

So if I run a few exits that advertise 20kiB/s but actually push traffic at 150MiB/s, and then I make them start advertising 1000MiB/s, how much of the Tor network's exit traffic can I sniff?

In both cases (before and after these changes) there will be a period of time where you capture a ton of traffic. Now, the bw authorities already cap each node instance to a max of 5% of the total network capacity, so there is a ceiling on how much you can capture this way, even for a period of time.

Before these changes, the period of time that you'd attract traffic is arbitrary, with a range of 0-5 days.

With these improvements, if we're using the minimum and we schedule a re-measurement when we notice the new minimum of 1000MiB, it is likely that clients have not started using you yet, so we'd still measure you as fast until we decide to measure you again. Right now, there is no plan to schedule this sooner, so it would probably be until the entire network was scanned and we started again. When we implement feedback (#1976), we can try to think about how to better handle this re-measurement period. But I'm not sure exactly how to detect this.

However, also realize the bw auths are not meant to be a strong defence against lying or gaming, so there are other attacks here, too. They are primarily a performance tool, with only minimal defences against gaming. For stronger defence against lying/gaming, we'd need something like EigenSpeed, but EigenSpeed has the exact opposite characteristics. It is very good against liars and gamers (compared to the bw auths, at least), but it sucks when it comes to properly measuring the higher capacity nodes:
http://www.usenix.org/event/iptps09/tech/full_papers/snader/snader.pdf

Nikita supposedly has some grad students looking at either improving EigenSpeed's high-end measurement properties, and/or combining it with active measurements.. The status of this is unknown, though.

comment:14 Changed 14 months ago by teor

Parent ID: #13630
Severity: Blocker

This is a feature that belongs in the new bwauth replacement project, see #13630.

comment:15 Changed 14 months ago by teor

Severity: BlockerNormal

Priorities and Severities in torflow are meaningless, setting them all to Medium/Normal.

comment:16 Changed 8 months ago by juga

Cc: juga added
Note: See TracTickets for help on using tickets.