Opened 4 years ago

Last modified 11 months ago

#16255 needs_revision defect

Guardfraction on dirauths screws up bandwidth weights

Reported by: asn Owned by: asn
Priority: High Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-dirauth, tor-guard, mike-can, SponsorU-deferred
Cc: Actual Points:
Parent ID: #24456 Points: medium/large
Reviewer: Sponsor:

Description

It seems that dirauths stopped including bandwidth weights on the consensus after Guardfraction got enabled:
https://lists.torproject.org/pipermail/tor-dev/2015-June/008908.html

Looking at weasel's dirauth we get plenty of related error messages:

[warn] Bw Weights error 1 for Case 3be (E scarce, Wee=1, Wmd == Wgd) v10. G=14793673 M=8310679 E=3428814 D=7040272 T=26698303 Wmd=-512 Wme=0 Wmg=2192 Wed=11025 Wee=10000 Wgd=-512 Wgg=7808 Wme=0 Wmg=2192 weight_scale=10000

That is, it seems like the bandwidth calculation of the dirauths got screwed a bit. This might be the result of using the guardfraction data during bandwidth calculation in compute_weighted_bandwidths() that might combine badly with #13297.

The negative Wmd/Wgd values seem weird here as well.

Let's disable the GuardFraction feature for now, till we figure out this issue.

Child Tickets

Change History (37)

comment:1 Changed 4 years ago by arma

Ok, I disabled GuardfractionFile on moria1 for now.

comment:2 Changed 4 years ago by asn

OK, I found a silly bug that might be causing this problem.

When we were calculating the total weights (T, G, M, E, D), while we were considering guardfraction information on the D/E weights and G/M weights, we didn't increase the total weight T by the right amount.

So for example, if a router with bandwidth 1000 is 75% guard, it means it has 750 bandwidth as a guard (should be assigned to G), and 250 bandwidth as a non-guard (should be assigned to M). The issue is that we only incremented T by 750, whereas we should increment it by 750 + 250.

This lead to T being much smaller than the sum G + M + E + D, which shouldn't happen IIUC.
This is probably why we ended up with negative weights in the end.

---

The above is an easy issue to fix. A harder issue to fix is that clients with disabled guardfraction, will try to validate the total weights in networkstatus_verify_bw_weights(). Since they don't understand guardfraction they will end up with different values than the authorities and then they will complain with stuff like:

Jun 02 17:47:59.000 [warn] Bw Weight Failure for Case 3b: Etotal 10050.492000 != Mtotal 9436.164000. G=12930 M=0 E=0 D=21480 T=34410. Wgg=1.000000 Wgd=0.092800 Wmg=0.000000 Wme=0.000000 Wmd=0.439300 Wee=1.000000 Wed=0.467900

Also, because of #13297 the authorities themselves will also complain after parsing their own consensus.

Need to think more how to correctly deploy this.

comment:3 in reply to:  2 Changed 3 years ago by asn

Replying to asn:

OK, I found a silly bug that might be causing this problem.

When we were calculating the total weights (T, G, M, E, D), while we were considering guardfraction information on the D/E weights and G/M weights, we didn't increase the total weight T by the right amount.

So for example, if a router with bandwidth 1000 is 75% guard, it means it has 750 bandwidth as a guard (should be assigned to G), and 250 bandwidth as a non-guard (should be assigned to M). The issue is that we only incremented T by 750, whereas we should increment it by 750 + 250.

This lead to T being much smaller than the sum G + M + E + D, which shouldn't happen IIUC.
This is probably why we ended up with negative weights in the end.

---

The above is an easy issue to fix. A harder issue to fix is that clients with disabled guardfraction, will try to validate the total weights in networkstatus_verify_bw_weights(). Since they don't understand guardfraction they will end up with different values than the authorities and then they will complain with stuff like:

Jun 02 17:47:59.000 [warn] Bw Weight Failure for Case 3b: Etotal 10050.492000 != Mtotal 9436.164000. G=12930 M=0 E=0 D=21480 T=34410. Wgg=1.000000 Wgd=0.092800 Wmg=0.000000 Wme=0.000000 Wmd=0.439300 Wee=1.000000 Wed=0.467900

Hm, that's not entirely true. Only authorities validate the bw weights in networkstatus_verify_bw_weights(). Hence we just need to introduce guardfraction logic to that function. Ideally, we would use update_total_bandwidth_weights() there as well.

comment:4 Changed 3 years ago by isabela

Keywords: https://trac.torproject.org/projects/tor/query?status=!closed&keywords=~TorCoreTeam201507 added

comment:5 Changed 3 years ago by nickm

Milestone: Tor: 0.2.6.x-finalTor: 0.2.7.x-final

comment:6 Changed 3 years ago by nickm

Keywords: TorCoreTeam201507 added; https://trac.torproject.org/projects/tor/query?status=!closed&keywords=~TorCoreTeam201507 removed

comment:7 Changed 3 years ago by asn

Status: newneeds_review

OK, I now have a branch bug16255_027 which addresses the issues I found.

Specifically:

  • We used to mess up the calculation of total bandwidth T when guardfraction relays were considered. That's because we only updated T with the guard bandwidth, and ignored the non-guard bandwidth. This should be addressed in commit 79f6c49 and should fix the Bw Weight Failure errors that blocked consensus weight creation.
  • As authorities we used to throw warnings when verifying the bandwidths of our own consensus, because we did not first apply the guardfraction information. This commit fixes this by applying the guardfraction information when parsing consensuses as authorities.

We should probably get this applied to most of authorities before turning the feature on again.

I also prepared a backport branch for 0.2.6 diruahts bug16255_026 that only includes the first fix (which is very small), and can be used if we want to try this feature on faster. Consensus weights should work with the backport branch, but the authorities will issue a warning when verifying their own consensus.

In general, it's a PITA testing bandwidth-related features with chutney because the bandwidth values in a test network are nothing compared to reality. Even without the above fixes, sometimes my chutney network will generate consensus weights because the bandwidth values are so low it doesn't even notice the T weight being off.

I'd appreciate a good review here and also pointer to other places in the guardfraction design that could cause failures.

comment:8 Changed 3 years ago by nickm

Keywords: TorCoreTeam201508 added; TorCoreTeam201507 removed

comment:9 Changed 3 years ago by nickm

Taking a quick note, I need to double-check this, but:

Does this patch change how the input votes affect the output consensus? If so, it must use a new consensus-method.

comment:10 in reply to:  9 Changed 3 years ago by asn

Replying to nickm:

Taking a quick note, I need to double-check this, but:

Does this patch change how the input votes affect the output consensus? If so, it must use a new consensus-method.

This fix indeed changes how the input votes affect the output consensus. That's because authorities with this patch will calculate different total bandwidth weights and Wgg/Wgd/Wmg/etc. values than unpatched authorities.

Still, I decided to not add a new consensus method for the following reasons:

  • We already have a consensus method for guardfraction (20). I would have to add a second stronger one just for this fix which seemed ugly (but this should not stop us if it's the right thing to do).
  • If I added a new consensus method, I would also have to change the encoding format on the votes (GuardFraction=50, etc). Otherwise, unpatched authorities that support the old consensus method (20) would still parse the GuardFraction values of the patched authorities thinking that they understand them. That would be a mistake since at that point unpatched authorities and patched authorities would generate different consensuses. So if we add a new consensus method we should probably also change the encoding format to something like: GuardFractionVal=50 or Guardfraction=50 or something, so that unpatched authorities do not understand it.

For the two reasons above, I decided (perhaps wrongly) to not add a new consensus method. Instead I provided both 026 and 027 patches and my plan was to ask the authorities to turn on the feature again only after a majority of the authorities have patched themselves.

Do you think that's a dangerous plan, and instead I should just make a new consensus method?

Thanks!

comment:11 Changed 3 years ago by nickm

I truly do think there should be a new consensus-method here.

After all, the entire point of having consensus-method support in Tor is so that we should never have to tell all the authorities "Okay, now everybody turn on the feature!" Using a consensus-method means that the feature turns itself on once everybody has it, and not before.

comment:12 in reply to:  11 ; Changed 3 years ago by asn

Replying to nickm:

I truly do think there should be a new consensus-method here.

After all, the entire point of having consensus-method support in Tor is so that we should never have to tell all the authorities "Okay, now everybody turn on the feature!" Using a consensus-method means that the feature turns itself on once everybody has it, and not before.

OK that makes sense then.

So, should I bump MIN_METHOD_FOR_GUARDFRACTION to 22?

And also, should I change the GuardFraction= encoding format to something else, so that unpatched authorities don't use it? To what? Do you prefer GuardFractionPercentage= or Guardfraction= or GuardFractionVal= or what?

comment:13 in reply to:  12 Changed 3 years ago by nickm

Replying to asn:

Replying to nickm:

I truly do think there should be a new consensus-method here.

After all, the entire point of having consensus-method support in Tor is so that we should never have to tell all the authorities "Okay, now everybody turn on the feature!" Using a consensus-method means that the feature turns itself on once everybody has it, and not before.

OK that makes sense then.

So, should I bump MIN_METHOD_FOR_GUARDFRACTION to 22?

This is a little tricky. Yes, but you'll need to add a 22 for 0.2.6, and a 23 for 0.2.7. And the code in 0.2.7 that checks for MIN_METHOD_FOR_ED25519_ID_* needs to make sure that it isn't including 22.

And also, should I change the GuardFraction= encoding format to something else, so that unpatched authorities don't use it? To what? Do you prefer GuardFractionPercentage= or Guardfraction= or GuardFractionVal= or what?

Whatever you like. "GuardPct" would be one option.

comment:14 Changed 3 years ago by nickm

Status: needs_reviewneeds_revision

comment:15 Changed 3 years ago by nickm

Keywords: TorCoreTeam201509 PostFreeze027 added; TorCoreTeam201508 removed
Owner: set to asn
Priority: normalmajor
Status: needs_revisionassigned

comment:16 Changed 3 years ago by nickm

Status: assignedneeds_revision

comment:17 Changed 3 years ago by nickm

Keywords: PostFreeze027 removed
Milestone: Tor: 0.2.7.x-finalTor: 0.2.8.x-final

asn says this should be ready in mid-sep, which makes it 0.2.8 material

comment:18 Changed 3 years ago by mikeperry

Keywords: 0.2.8.x-triage added

comment:19 Changed 3 years ago by mikeperry

Keywords: 028-triage added; 0.2.8.x-triage removed

comment:20 Changed 3 years ago by nickm

Keywords: 028-triaged added

comment:21 Changed 3 years ago by nickm

Points: medium/large

comment:22 Changed 3 years ago by mikeperry

Keywords: mike-can added

comment:23 Changed 3 years ago by nickm

Keywords: pre028-patch added

comment:24 Changed 3 years ago by asn

Milestone: Tor: 0.2.8.x-finalTor: 0.2.9.x-final
Severity: Normal

I'm gonna postpone this for 0.2.9.

I don't have enough time to get my testing setup and test this properly.

I will also need some assistance from Mike or sth figuring out if my bug16255_026 branch is the only fix required here. Maybe we can do it in Valencia.

comment:25 Changed 3 years ago by isabela

Sponsor: SponsorU-can

comment:26 Changed 3 years ago by nickm

Keywords: TorCoreTeam201509 removed

Removing TorCoreTeam201509 from these tickets, since we do not own a time machine.

comment:27 Changed 3 years ago by isabela

Milestone: Tor: 0.2.9.x-finalTor: 0.2.???

tickets market to be removed from milestone 029

comment:28 Changed 3 years ago by nickm

Keywords: SponsorU-deferred added
Sponsor: SponsorU-can

Remove the SponsorU status from these items, which we already decided to defer from 0.2.9. add the SponsorU-deferred tag instead in case we ever want to remember which ones these were.

comment:29 Changed 2 years ago by teor

Milestone: Tor: 0.2.???Tor: 0.3.???

Milestone renamed

comment:30 Changed 2 years ago by nickm

Keywords: tor-03-unspecified-201612 added
Milestone: Tor: 0.3.???Tor: unspecified

Finally admitting that 0.3.??? was a euphemism for Tor: unspecified all along.

comment:31 Changed 19 months ago by nickm

Keywords: tor-03-unspecified-201612 removed

Remove an old triaging keyword.

comment:32 Changed 19 months ago by nickm

Keywords: 028-triaged removed

comment:33 Changed 19 months ago by nickm

Keywords: 028-triage removed

comment:34 Changed 19 months ago by nickm

Keywords: pre028-patch removed

comment:35 Changed 18 months ago by nickm

Are we still interested in fixing this up, or should we wontfix it and rip out the guardfraction code?

comment:36 Changed 18 months ago by asn

I'm really tempted to suggest the latter option here, except if someone (who is not me) wants to take this on.

Fixing this up requires fighting with the bandwidth equations to ensure that the fix works, and that no other bugs are lying around. I've had bad times everytime I tried to test the bandwidth equations, since their behavior is different between chutney and the real network and bugs that appear on the real network might not appear on chutney (i.e. this bug) except if you have a big and realistic enough chutney network (the testnet might be a better test case here, but it still does not contain enough relays to test all the bandwidth equation cases, etc.).

Fortunately, we are not considering extending the guard lifetime to 9 months anymore, so this is not as crucial as we thought in the past (but still worth fixing eventually).

comment:37 Changed 11 months ago by teor

Parent ID: #24456

If we remove this code, we don't need to fix it

Note: See TracTickets for help on using tickets.