Opened 6 months ago

Last modified 5 months ago

#33871 needs_review defect

Scale exactly as torflow does?

Reported by: juga Owned by: juga
Priority: Medium Milestone: sbws: 1.1.x-final
Component: Core Tor/sbws Version:
Severity: Normal Keywords: sbws-majority-blocker, sbws-roadmap
Cc: juga, gk, mikeperry Actual Points:
Parent ID: #33775 Points:
Reviewer: ahf, gk Sponsor:

Description

#33775 shows that if sbws calculates low consensus bandwidth because of missing descriptors and the rest of bwauths are also using sbws, it'd enter in a spiral in which it'd keep measuring low.
I think this is because while torflow multiplies the calculated ratio by the descriptor observed bandwidth [0], while sbws multiplies the ratio by the minimum of all descriptor bandwidth values *and* the consensus, which was added in #28598.
So maybe the new consensus bandwidth should not depend on the previous one, or not as the minimum.
For a relation on how bandwidth values depend on each other, see [1]

[0] https://onbasca.readthedocs.io/en/latest/torflow_aggr.html
[1] https://onbasca.readthedocs.io/en/latest/bandwidth_tor.html#bandwidth-values-origin

Child Tickets

Change History (24)

comment:1 Changed 6 months ago by arma

Agreed: it seems that anchoring the new number to the current consensus number doesn't let things grow.

That said, this one is still a bit confusing/weird. How come Torflow's number can grow a lot but sbws's can't? Is it only because the underlying descriptor says a larger number? But that only happens if the relay gets actual traffic. And how does it do that when it has a tiny consensus weight? Which...oh, maybe it gets actual traffic *because* of the torflow tests. So by testing it, it becomes bigger, and then our ratio makes it become bigger still.

comment:2 Changed 6 months ago by teor

There's a related bug here, that might only show up when the last Torflow instance leaves the network.

The target download time is 6 seconds. Tor's download measurement time is 10 seconds. So on average, sbws will only be using 6/10 of a relay's capacity over any 10 second period.

I suggest that we make the average sbws download time at least 10 seconds, to avoid this issue.

comment:3 Changed 6 months ago by teor

(Relays could also do their own bandwidth self-tests, But they aren't large enough right now. And we probably shouldn't rely on them.)

comment:4 in reply to:  2 Changed 6 months ago by juga

Replying to teor:

There's a related bug here, that might only show up when the last Torflow instance leaves the network.

The target download time is 6 seconds. Tor's download measurement time is 10 seconds. So on average, sbws will only be using 6/10 of a relay's capacity over any 10 second period.

I suggest that we make the average sbws download time at least 10 seconds, to avoid this issue.

Which is similar to ticket #28545 (with 11 seconds).

If understood correctly the way that Torflow calculates the new consensus banwdwidth [0] (and might be worth someone double check), the measured bandwidth influences very "little" on the new bandwidth (which i've try to say some times).
So i think, that changing the download time have less impact on the new calculated bandwidth than the consensus/descriptor previous bandwidth values.

[0] https://onbasca.readthedocs.io/en/latest/torflow_aggr.html, https://gitweb.torproject.org/torspec.git/tree/bandwidth-file-spec.txt#n1167

comment:5 in reply to:  description ; Changed 6 months ago by starlight

Replying to juga:

I think this is because while torflow multiplies the calculated ratio by the descriptor observed bandwidth [0], while sbws multiplies the ratio by the minimum of all descriptor bandwidth values *and* the consensus, which was added in #28598.

flat wrong, see ticket:28588#comment:4

So maybe the new consensus bandwidth should not depend on the previous one, or not as the minimum.

The problems with SBWS are due to poor quality absolute bandwidth measurement.

comment:6 Changed 6 months ago by juga

gk and i thought that as a first step, we'll implement this patch and deploy it in maatuska.
We'll then think more on how to check this before the majority of bwauths switch to sbws.

comment:7 in reply to:  5 Changed 6 months ago by juga

Replying to starlight:

Replying to juga:

I think this is because while torflow multiplies the calculated ratio by the descriptor observed bandwidth [0], while sbws multiplies the ratio by the minimum of all descriptor bandwidth values *and* the consensus, which was added in #28598.

flat wrong, see ticket:28588#comment:4

i don't know what you mean about "flat wrong". That comment was about to include the "observed bandwidth" from the relay descriptors.

So maybe the new consensus bandwidth should not depend on the previous one, or not as the minimum.

The problems with SBWS are due to poor quality absolute bandwidth measurement.

i don't know what you mean about "poor quality" and "absolute bandwidth".

comment:8 Changed 6 months ago by juga

Owner: set to juga
Status: newassigned

comment:9 Changed 5 months ago by juga

Cc: mike added

I'm CCing mike, cause i'd love he checks my interpretation on what Torflow does is correct, to be able to deploy sbws in all bwauths.

Initially, i thought this ticket would be easy to solve by just taking the maximum between the descriptor bandwidth and the consensus bandwidth, instead of the minimum, or not considering the consensus bandwidth at all.

Then i realized that would be a big error, since:

  1. that would make the new consensus bandwidth to depend basically on what the relay says to have
  2. at some point we migth remove the initial bandwidth self-test from tor (#22453)

So i went again through Torflow code and i realized about the following:

  1. Filtered bandwidth

While my understanding on how it is calculated in Torflow is correct, it was not the implementation.
In sbws it is not calculated the filtered bandwidth for *each* relay, but only the one for all the relays. I'll submit a patch for this. This does not affect too much to the scaling though.

  1. Descriptor bandwidth

TorCtl takes as observed bandwidth the minimum of the observed and burst descriptor bandwidths [6].
I observed that if the burst bandwidth is 0, then the descriptor bandwidth will be 1, no matter if there was observed bandwidth.
sbws takes as observed bandwidth the minimum of all descriptor bandwidths since some version, except that is currently discarding any relay without descriptor average, not only any relay without observed bandwidth. I'll submit a patch for this.

sbws has not being having descriptor or consensus bandwidth updated because of other bugs, which should be fixed now (to check in #33947).

However, what should happen when sbws is missing one of the descriptor observed, average or burst bandwidth?
Without the observed bandwidth, doesn't make sense to take the minimum of the other two, cause they're values set by the operator and are usually bigger.
Would it be fine to take only the observed bandwidth when the average or/and burst bandwidth are missing? (assuming is not a sbws bug), or if the operator set either average or burst bandwidth to 0, should we still consider observed bandwidth to be 1?

  1. Descriptor bandwidth vs consensus bandwidth:

In #28598 we started to take the minimum of the descriptor bandwidth and the consensus bandwidth.

According to Torflow README.spec.txt [2]:

For the case where K_p = 1, K_i=0, and K_d=0, it can be seen that this
system is equivalent to the one defined in 2.2, except using consensus
bandwidth instead of descriptor bandwidth

In current Torflow code, the K variables have those values, but it's using the descriptor, not the consensus.
In aggregate.py, bwauth_pid_control and use_desc_bw are initialized to True [3], so Node is also initialized to use the descriptor bandwidth and not the consensus bandwidth [4]. I'm quite sure that any other if-else where use_bw might be something else, it's node cause the default initialization makes that code is never executed.
However Toflow does not seem to scale at all if there is a consensus bandwidth missing. [8]

sbws takes only the descriptor bandwidth if the consensus bandwidth is missing.
But what if the descriptor bandwidth is missing and consensus bandwidth is present?, should we take only the consensus bandwidth?.
Assuming it's not an sbws bug, is it possible to have only consensus bandwidth without descriptor bandwidth?
Should we scale in any case taking the minimum of the observed bandwidth and consensus bandwidth, which would be 1 if any of them is missing?

Not having an easy/fast way to simulate what happen when we replace all torflows by sbws, we'd need anyways to human-monitor the consensus when we start having again the half of bwauths with sbws.

[2] https://gitweb.torproject.org/torflow.git/tree/NetworkScanners/BwAuthority/README.spec.txt#n390
[3] https://gitweb.torproject.org/torflow.git/tree/NetworkScanners/BwAuthority/aggregate.py#n237
[4] https://gitweb.torproject.org/torflow.git/tree/NetworkScanners/BwAuthority/aggregate.py#n589
[6] https://gitweb.torproject.org/pytorctl.git/tree/TorCtl.py#n460, https://gitweb.torproject.org/pytorctl.git/tree/TorCtl.py#n488
[8] https://gitweb.torproject.org/torflow.git/tree/NetworkScanners/BwAuthority/aggregate.py#n402

comment:10 Changed 5 months ago by arma

Cc: mikeperry added; mike removed

cc'ing the real mike

comment:11 Changed 5 months ago by mikeperry

Juga: Your analysis in #1 is correct to my memory, and I agree probably not causing this issue.

Re #2 and #3: whats is the sbws torrc? TorFlow sets all the aggressive consensus and descriptor fetching options, so it may have descriptors *not* in the consensus, even. Or descriptors for relays oscillating in the consensus.

Torflow *also* uses Tor 0.2.9, which may have different descriptor vs microdescriptor fetching behavior. If you are using microdescriptors, you might also not be getting fresh enough relay data, because those have to be generated by at least one consensus process.

I think these are more likely culprits for this specific ticket.

Here is the relevant torrc piece, which I think may behave differently with Tor 0.2.9:

FetchUselessDescriptors 1
# Workaround for Tor #24110, tracked in TorFlow #24094
UseMicrodescriptors 0
__LeaveStreamsUnattached 1

# Bad idea? Too much consensus update activity?
FetchDirInfoExtraEarly 1
FetchDirInfoEarly 1

comment:12 in reply to:  11 Changed 5 months ago by juga

Replying to mikeperry:

Juga: Your analysis in #1 is correct to my memory, and I agree probably not causing this issue.

Re #2 and #3: whats is the sbws torrc? TorFlow sets all the aggressive consensus and descriptor fetching options, so it may have descriptors *not* in the consensus, even. Or descriptors for relays oscillating in the consensus.

Torflow *also* uses Tor 0.2.9, which may have different descriptor vs microdescriptor fetching behavior. If you are using microdescriptors, you might also not be getting fresh enough relay data, because those have to be generated by at least one consensus process.

I think these are more likely culprits for this specific ticket.

Here is the relevant torrc piece, which I think may behave differently with Tor 0.2.9:

FetchUselessDescriptors 1
# Workaround for Tor #24110, tracked in TorFlow #24094
UseMicrodescriptors 0
__LeaveStreamsUnattached 1

# Bad idea? Too much consensus update activity?
FetchDirInfoExtraEarly 1
FetchDirInfoEarly 1

I agree it's important to update update correctly descriptors and consensus, but this should has been fixed in #30733 and other tickets.

I think i didn't make my questions clear.
Let's assume consensus and descriptors are updated correctly, and the ratio is also calculated correctly.

To my understanding, what basically Torflow does is:

new_bw = min(descriptor observed bandwidth, descriptor burst bandwidth, 1) * ratio
Iff consensus bandwidth is not None (for all relays).

And what sbws is doing:

new_bw = min(consensus bandwidth, descriptor observed bandwidth, descriptor average bandwidth, descriptor burst bandwidth, 1) * ratio
Iff:

  • descriptor observed bandwidth > 0 and descriptor observed bandwidth != None (for that relay) and
  • descriptor average bandwidth > 0 and descriptor average bandwidth != None (for that relay) and
  • descriptor burst bandwidth != None

I think that treating in a different way average bandwidth and burst bandwidth is a bug and i think that if consensus AND observed are missing, we should not calculate the minimum with the average and burst only cause those numbers could be high and are set by the operator.

So, the main question is: should we calculate the new_bw (scale) for a relay when some of the bandwidth values is missing or 0 (what would result in 1 for the min)?. If not, which are the bandwidth values we should have at least?

I'd answer (but not sure):

  • Iff consensus AND observed are missing: do not scale, ie, don't include the relay in the bandwidth file as to vote
  • scale in any other case with the values we have (and convert None values to 0?. If there're None values there is still a bug in sbws we need to solve)

comment:13 in reply to:  11 Changed 5 months ago by teor

I agree with what Juga said, but I'm leaving these notes here, in case you need them in future.

Replying to mikeperry:

Torflow *also* uses Tor 0.2.9, which may have different descriptor vs microdescriptor fetching behavior.

Later tor versions might also have different descriptor caching behaviour. And there could also be bugs in how sbws gets descriptors from tor, or how it handles relays throughout the process.

But if it turns out that there are still descriptor issues, we can diagnose them by comparing:

Unfortunately, we lose some descriptors at each stage, so the analysis won't be easy.

I think these are more likely culprits for this specific ticket.

Here is the relevant torrc piece, which I think may behave differently with Tor 0.2.9:

FetchUselessDescriptors 1
# Workaround for Tor #24110, tracked in TorFlow #24094
UseMicrodescriptors 0
__LeaveStreamsUnattached 1

# Bad idea? Too much consensus update activity?
FetchDirInfoExtraEarly 1
FetchDirInfoEarly 1

As juga said, sbws sets similar options:
https://github.com/torproject/sbws/blob/master/sbws/globals.py#L20

For 0.4.0 and later, sbws might also want to set:

DormantClientTimeout 7 days
DormantCanceledByStartup 1

So that a network outage or temporary sbws hang won't cause tor to stop downloading new descriptors.

But that's unlikely to be the source of your problem, because the default DormantClientTimeout is 24 hours, which is a long network outage.

comment:14 Changed 5 months ago by mikeperry

Oh I missed that sbws was including avg_bandwidth in the min().. That definitely should not be there. Average will always be below observed, and we want observed or burst because burst can be limited by config by operators who don't want traffic, giving them at least one (admittedly sloppy) knob to try to reduce traffic.

I just dug a bit deeper into the 0 bw issue, and I found that we actually set it to at leasty 1, to simplify computation. This is done in the TorCtl.Router class: https://gitweb.torproject.org/pytorctl.git/tree/TorCtl.py#n379

So we should probably do something that in sbws too (either hack to 1 or riddle with special cases everywhere), so we can still measure 0-bw relays.

FYI: There is a related hack for Unmeasured=1 relays here: https://trac.torproject.org/projects/tor/ticket/6131

comment:15 in reply to:  14 ; Changed 5 months ago by juga

Replying to mikeperry:

Oh I missed that sbws was including avg_bandwidth in the min().. That definitely should not be there. Average will always be below observed,

I don't think this is the case, just taking randomly a relay from my cached-descriptors:

router rotateRelay21 188.227.170.106 2109 0 0
fingerprint 1387 BB67 C0A1 BCE2 8133 FF72 39E0 27D5 4C56 538D
bandwidth 2621440 10485760 5445375

I also looked 2 years ago at the spec and the code (https://onbasca.readthedocs.io/en/latest/bandwidth_tor.html) and concluded that the average is min(RelayBandwidthRate, BandwidthRate, MaxAdvertisedBandwidth), which are all set by the operator (or torrc defaults), am i mistaken?.
Maybe you're thinking on the advertised bandwidth, not the average?

and we want observed or burst because burst can be limited by config by operators who don't want traffic, giving them at least one (admittedly sloppy) knob to try to reduce traffic.

I agree with this, but for the same reason we should also limit it by the average (https://gitweb.torproject.org/sbws.git/tree/sbws/lib/v3bwfile.py#n1340)

I just dug a bit deeper into the 0 bw issue, and I found that we actually set it to at leasty 1, to simplify computation. This is done in the TorCtl.Router class: https://gitweb.torproject.org/pytorctl.git/tree/TorCtl.py#n379

So we should probably do something that in sbws too (either hack to 1 or riddle with special cases everywhere), so we can still measure 0-bw relays.

Yes, we're doing that in sbws, note that in my previous reply i included a 1 in the new_bw equations, ie. new_bw = min(consensus bandwidth, descriptor observed bandwidth, descriptor average bandwidth, descriptor burst bandwidth, 1) * ratio

What i mean is that it's a different thing to calculate the minimum when one of those values is 0 (the result will be 1) or to exclude one of those values when it's 0 (the result will be greater or equal than 1).

Note that AFAIU, Torflow code is not calculating the new_bw using the consensus bandwidth, though, it just does not produce any bandwidth file if any consensus bandwidth is missing.

comment:16 in reply to:  15 Changed 5 months ago by juga

Replying to juga:
[snip]

I just dug a bit deeper into the 0 bw issue, and I found that we actually set it to at leasty 1, to simplify computation. This is done in the TorCtl.Router class: https://gitweb.torproject.org/pytorctl.git/tree/TorCtl.py#n379

So we should probably do something that in sbws too (either hack to 1 or riddle with special cases everywhere), so we can still measure 0-bw relays.

Yes, we're doing that in sbws, note that in my previous reply i included a 1 in the new_bw equations, ie. new_bw = min(consensus bandwidth, descriptor observed bandwidth, descriptor average bandwidth, descriptor burst bandwidth, 1) * ratio

Oh, sorry, i wanted to put all in 1 line, but this would be always 1 :)
So it should be the maximum between any bandwidth value and 1 and then the minimum with those.
I realized that we're actually taking the maximum of the new bandwidth and 1 at the very end, which would be also different to take the maximum of any bandwidth value an 1 before multiplying by the ratio. We should probably do it in this last way, as torflow does.

What i mean is that it's a different thing to calculate the minimum when one of those values is 0 (the result will be 1) or to exclude one of those values when it's 0 (the result will be greater or equal than 1).

Note that AFAIU, Torflow code is not calculating the new_bw using the consensus bandwidth, though, it just does not produce any bandwidth file if any consensus bandwidth is missing.

comment:17 in reply to:  15 ; Changed 5 months ago by mikeperry

Replying to juga:

Replying to mikeperry:

Oh I missed that sbws was including avg_bandwidth in the min().. That definitely should not be there. Average will always be below observed,

I don't think this is the case, just taking randomly a relay from my cached-descriptors:

router rotateRelay21 188.227.170.106 2109 0 0
fingerprint 1387 BB67 C0A1 BCE2 8133 FF72 39E0 27D5 4C56 538D
bandwidth 2621440 10485760 5445375

I also looked 2 years ago at the spec and the code (https://onbasca.readthedocs.io/en/latest/bandwidth_tor.html) and concluded that the average is min(RelayBandwidthRate, BandwidthRate, MaxAdvertisedBandwidth), which are all set by the operator (or torrc defaults), am i mistaken?.
Maybe you're thinking on the advertised bandwidth, not the average?

Hrm it appears TorFlow is actually taking the minimum of all three here: https://gitweb.torproject.org/pytorctl.git/tree/TorCtl.py#n459

So this should match sbws taking the min of all three..

Sorry I did not recall that either. It has been 8 years since I worked on this code.

If it is not the order of composition of the min() you mention, then I am not sure what the issue is still :/.

comment:18 Changed 5 months ago by juga

Reviewer: ahf, gk
Status: assignedneeds_review

comment:19 in reply to:  17 Changed 5 months ago by juga

Replying to mikeperry:
[snip]

Hrm it appears TorFlow is actually taking the minimum of all three here: https://gitweb.torproject.org/pytorctl.git/tree/TorCtl.py#n459

So this should match sbws taking the min of all three..

Sorry I did not recall that either. It has been 8 years since I worked on this code.

Yeah, understand :)
Hopefully this documentation helps to clarify what Torflow does: https://gitlab.torproject.org/juga/sbws/-/blob/maint-1.1_bug33871_scaling/docs/source/torflow_aggr.rst

comment:20 Changed 5 months ago by juga

Note for the reviewer:
In the end i'm taking into account consensus bandwidth when there is not observed bandwidth, but i'm still not sure that's what should be done cause Torflow doesn't do it.
I'm also not sure whether to take the tor process observed bandwidth instead of the observed bandwidth that was stored at the moment of making the measurement. Torflow does not do this, but it measures relays faster and more often.
I've observed (see https://trac.torproject.org/projects/tor/ticket/33947#comment:2) that the relays that have big differences with Torflow might be due missing descriptor/consensus bandwidth.

comment:21 in reply to:  18 Changed 5 months ago by gk

Status: needs_reviewneeds_revision

Replying to juga:

https://gitlab.torproject.org/torproject/network-health/sbws/-/merge_requests/7

While thinking over your comment:20 I've given the branch a first pass and it looks mostly good to me. There are a bunch of typos and smaller things you might want to correct. Thus, putting this into needs_revision.

comment:22 Changed 5 months ago by juga

Status: needs_revisionneeds_review

thanks, I think i addressed your comments

comment:23 in reply to:  22 Changed 5 months ago by gk

Status: needs_reviewneeds_revision

Replying to juga:

thanks, I think i addressed your comments

Almost. One is left. :)

comment:24 Changed 5 months ago by juga

Status: needs_revisionneeds_review

Should be fine now :)

Note: See TracTickets for help on using tickets.