Opened 13 months ago

Last modified 7 months ago

#24456 new defect

Figure out what to do with the guardfraction feature

Reported by: asn Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-dirauth, tor-guard, review-group-32, review-group-33, review-group-34, 033-triage-20180320, 033-included-20180320
Cc: Actual Points:
Parent ID: Points: 2
Reviewer: mikeperry Sponsor:

Description (last modified by arma)

Guardfraction code is still around the codebase: guard_get_guardfraction_bandwidth(), should_apply_guardfraction(), routerstatus_parse_guardfraction(), guardfraction_line_apply(), guardfraction_file_parse_guard_line(), dirserv_read_guardfraction_file_from_str(), compute_weighted_bandwidths(). Also AFAIK some dirauths are still running the guardfraction python script.

Unfortunately, guardfraction code is still broken because of #16255 and possibly other unknown bugs. Hence, guardfraction support is disabled in the dirauths and all that code is currently useless.

My current plan here is to remove the guardfraction code from the Tor codebase, since fixing it and maintaining it as a PITA that I don't want to sign up for and we probably don't need it as much as we thought we did (see comment:36:ticket:16255).

Child Tickets

TicketStatusOwnerSummaryComponent
#16255needs_revisionasnGuardfraction on dirauths screws up bandwidth weightsCore Tor/Tor
#25224newCreate a new consensus method that ignores guardfraction votesCore Tor/Tor

Change History (31)

comment:1 Changed 13 months ago by asn

Description: modified (diff)

comment:2 Changed 11 months ago by asn

Status: newneeds_review

OK I did it! I ripped off the code from the codebase. Here are the rewards:

19 files changed, 11 insertions(+), 1063 deletions(-)

Please see branch ticket24456 in my repo for this. Tests pass and make test-network-all pass. The diff is pretty simple.

Rationale for the removal is cited in the commit msg.

comment:3 in reply to:  2 ; Changed 11 months ago by teor

Status: needs_reviewneeds_revision

Replying to asn:

OK I did it! I ripped off the code from the codebase. Here are the rewards:

19 files changed, 11 insertions(+), 1063 deletions(-)

Please see branch ticket24456 in my repo for this.

Quick structural review:

We don't delete options, we OBSOLETE() them.

We can't just delete code from networkstatus_compute_consensus() and any the functions it calls. Instead, we add a new consensus method, and only run the old code when the consensus method is old enough. For guardfraction, there will be a range of consensus methods that run the guardfraction code.

Votes are different: we can make authorities stop voting guardfraction without needing a new consensus method.

How buggy is guardfraction?
Does it work?
Does it cause consensus failures when it's turned on?
Should we backport the "don't vote guardfraction" part of this patch to 0.3.1 and 0.3.2?
(The minimal patch obsoletes the options, but leaves them in or_options_t. This works because zero means off.)

Tests pass and make test-network-all pass. The diff is pretty simple.

This needs to pass the "mixed" chutney test, which is only run if you have a binary called "tor-stable" in your path.
I usually make this happen by doing "ln -s /usr/bin/tor /usr/local/bin/tor-stable".

Also, the mixed network needs to pass with guardfraction enabled on the stable authorities.

Rationale for the removal is cited in the commit msg.

comment:4 in reply to:  3 ; Changed 10 months ago by asn

Replying to teor:

Replying to asn:

OK I did it! I ripped off the code from the codebase. Here are the rewards:

19 files changed, 11 insertions(+), 1063 deletions(-)

Please see branch ticket24456 in my repo for this.

Quick structural review:

We don't delete options, we OBSOLETE() them.

We can't just delete code from networkstatus_compute_consensus() and any the functions it calls. Instead, we add a new consensus method, and only run the old code when the consensus method is old enough. For guardfraction, there will be a range of consensus methods that run the guardfraction code.

Oh man you are right. I guess we need to do this even tho no dirauths run the guardfraction code right now and they don't plan to start... So I'd need to define a new consensus method MIN_METHOD_FOR_REMOVING_GUARDFRACTION and only run the guardfraction consensus code if it's between MIN_METHOD_FOR_GUARDFRACTION and MIN_METHOD_FOR_REMOVING_GUARDFRACTION.

And then eventually when all dirauths are upgraded to versions >= MIN_METHOD_FOR_REMOVING_GUARDFRACTION we can remove that consensus guardfraction code too. Does this plan make sense?

Votes are different: we can make authorities stop voting guardfraction without needing a new consensus method.

How buggy is guardfraction?

It suffers from #16255 and perhaps other unknown bugs. It's extremely hard to test this feature on chutney (because chutney's bandwidth weights are not realistic), so we only learned of #16255 when we deployed it on the real net.

Does it work?

It "works" but it's buggy. Authorities compute bad consensus weights for the relays and
then the bandwidth equations complain.

Does it cause consensus failures when it's turned on?

Yes.

Should we backport the "don't vote guardfraction" part of this patch to 0.3.1 and 0.3.2?

Not sure. Perhaps not. No dirauths have the guardfraction torrc option enabled anyway.

comment:5 in reply to:  4 Changed 10 months ago by teor

Replying to asn:

Replying to teor:

Replying to asn:

OK I did it! I ripped off the code from the codebase. Here are the rewards:

19 files changed, 11 insertions(+), 1063 deletions(-)

Please see branch ticket24456 in my repo for this.

Quick structural review:

We don't delete options, we OBSOLETE() them.

We can't just delete code from networkstatus_compute_consensus() and any the functions it calls. Instead, we add a new consensus method, and only run the old code when the consensus method is old enough. For guardfraction, there will be a range of consensus methods that run the guardfraction code.

Oh man you are right. I guess we need to do this even tho no dirauths run the guardfraction code right now and they don't plan to start... So I'd need to define a new consensus method MIN_METHOD_FOR_REMOVING_GUARDFRACTION and only run the guardfraction consensus code if it's between MIN_METHOD_FOR_GUARDFRACTION and MIN_METHOD_FOR_REMOVING_GUARDFRACTION.

It's >= MIN_METHOD_FOR_GUARDFRACTION and < MIN_METHOD_FOR_REMOVING_GUARDFRACTION.

And then eventually when all dirauths are upgraded to versions >= MIN_METHOD_FOR_REMOVING_GUARDFRACTION we can remove that consensus guardfraction code too. Does this plan make sense?

Technically, we can't remove the guardfraction code until we remove all the consensus methods that could have used it.
At the moment, we support all methods from 13-28, excluding 21.

We can't remove all the buggy guardfraction methods like we removed 21, because the buggy range is 20-28.
(Maybe we could remove 20, because we'll never use it.)

See also #24378, where I suggest we start removing some of the lower methods. The last time we removed consensus methods was in #10163.

Votes are different: we can make authorities stop voting guardfraction without needing a new consensus method.

How buggy is guardfraction?

It suffers from #16255 and perhaps other unknown bugs. It's extremely hard to test this feature on chutney (because chutney's bandwidth weights are not realistic), so we only learned of #16255 when we deployed it on the real net.

Does it work?

It "works" but it's buggy. Authorities compute bad consensus weights for the relays and
then the bandwidth equations complain.

Does it cause consensus failures when it's turned on?

Yes.

Ok, so let's rip out the vote generation code in master as soon as we obsolete the option.

Should we backport the "don't vote guardfraction" part of this patch to 0.3.1 and 0.3.2?

Not sure. Perhaps not. No dirauths have the guardfraction torrc option enabled anyway.

Let's backport making the option obsolete to 0.2.9 and later.
That's a one-line patch.

comment:6 Changed 10 months ago by asn

OK let's recap here because these backward compatibility stuff confuse me!

For 033 master we need a patch that obsoletes the config options (also removes all the config option-relevant code), and also kills the guardfraction voting feature. See my branch bug24456_033 for this.

For 029 and later, we just need a patch that obsoletes the config options. See my branch bug24456_029 for this.

Then later in the future, when most dirauths use tor versions with obsolete config options, we can completely kill the rest of the code (vote/consensus guardfraction parsing, consensus voting code, client code, etc.).

Does that make sense? If it does, I'll add changes file etc to the branches above, and also make further backport branches.

Thanks for the help Tim :]

Last edited 10 months ago by asn (previous) (diff)

comment:7 Changed 10 months ago by asn

Status: needs_revisionneeds_review

comment:8 in reply to:  6 ; Changed 10 months ago by teor

Replying to asn:

OK let's recap here because these backward compatibility stuff confuse me!

For 033 master we need a patch that obsoletes the config options (also removes all the config option-relevant code), and also kills the guardfraction voting feature. See my branch bug24456_033 for this.

This is ok. Refusing to vote guardfraction is compatible with any consensus method.

For 029 and later, we just need a patch that obsoletes the config options. See my branch bug24456_029 for this.

This is ok. Refusing to vote guardfraction is compatible with any consensus method.

Then later in the future, when most dirauths use tor versions with obsolete config options, we can completely kill the rest of the code (vote/consensus guardfraction parsing, consensus voting code, client code, etc.).

One minor tweak:

Clients can ignore guardfraction in 033 or 034 if we want. Clients do not have to use consensus features.

And here is how we remove code that depends on consensus method 20 (the guardfraction method):

  1. Create a new consensus method that ignores guardfraction votes. This new method can go in 033 or 034.
  2. Change all the code that says ">= MIN METHOD FOR GUARDFRACTION" to say ">= MIN METHOD FOR GUARDFRACTION and < MIN METHOD TO IGNORE GUARDFRACTION" (the new method). This disables guardfraction when the new method or any later method is used. See consensus method 28 for an example of the code and spec that we need to remove a consensus feature.
  3. When we will never revert to a lower consensus method (two releases later, 035 or 036), stop authorities voting for the buggy methods 20-28, and remove a whole bunch of old code, including all the guardfraction code.
  4. At that time, we might want to remove all the methods lower than 20, too. See #24378.

Does that make sense? If it does, I'll add changes file etc to the branches above, and also make further backport branches.

Thanks for the help Tim :]

No worries. Consensus methods are fun :-)

comment:9 in reply to:  8 Changed 10 months ago by asn

Replying to teor:

Replying to asn:

OK let's recap here because these backward compatibility stuff confuse me!

For 033 master we need a patch that obsoletes the config options (also removes all the config option-relevant code), and also kills the guardfraction voting feature. See my branch bug24456_033 for this.

This is ok. Refusing to vote guardfraction is compatible with any consensus method.

For 029 and later, we just need a patch that obsoletes the config options. See my branch bug24456_029 for this.

This is ok. Refusing to vote guardfraction is compatible with any consensus method.

Then later in the future, when most dirauths use tor versions with obsolete config options, we can completely kill the rest of the code (vote/consensus guardfraction parsing, consensus voting code, client code, etc.).

One minor tweak:

Clients can ignore guardfraction in 033 or 034 if we want. Clients do not have to use consensus features.

And here is how we remove code that depends on consensus method 20 (the guardfraction method):

  1. Create a new consensus method that ignores guardfraction votes. This new method can go in 033 or 034.
  2. Change all the code that says ">= MIN METHOD FOR GUARDFRACTION" to say ">= MIN METHOD FOR GUARDFRACTION and < MIN METHOD TO IGNORE GUARDFRACTION" (the new method). This disables guardfraction when the new method or any later method is used. See consensus method 28 for an example of the code and spec that we need to remove a consensus feature.
  3. When we will never revert to a lower consensus method (two releases later, 035 or 036), stop authorities voting for the buggy methods 20-28, and remove a whole bunch of old code, including all the guardfraction code.
  4. At that time, we might want to remove all the methods lower than 20, too. See #24378.

Hmm, let's do all this stuff in 034 I think. Also separating client guardfraction functionality from dirauth-guardfraction functionality is not so easy, since they both use the same funcs, but the latter also uses it for consensus making. So perhaps we can roll with the branches I posted above for now, and build on top of them for 034.

Pushed branches bug24456_029, bug24456_030, bug24456_031 and bug24456_032 with the plan from comment:6 .

Last edited 10 months ago by asn (previous) (diff)

comment:10 Changed 10 months ago by teor

Also separating client guardfraction functionality from dirauth-guardfraction functionality is not so easy, since they both use the same funcs, but the latter also uses it for consensus making. So perhaps we can roll with the branches I posted above for now, and build on top of them for 034.

Sure!

I suggest we stop clients (and relays) calling those functions in 0.3.4. We can leave the functions as authority-only.

comment:11 Changed 10 months ago by nickm

Keywords: review-group-32 added

comment:12 Changed 10 months ago by mikeperry

Reviewer: mikeperry

comment:13 Changed 10 months ago by teor

The consensus method ticket is #25224

comment:14 Changed 10 months ago by mikeperry

After catching up on the related bugs (#13297, #16255), it seems like the reason this is being killed is because of the lack of unit test support to take in all inputs for a consensus (votes, bw auth files, guardfraction files) and produce a consensus, and then examine that consensus for expected results for both values and client usage. If we had that, it would be much easier to verify that the fix in #16255 is sufficient, right?

If that is true, then I would rather obsolete the torrc options for now without removing any code, and file a ticket for better consensus testing support. There have been several other path selection tickets where I wished I just had a full consensus to check behavior against, rather than cobbling together a mock or tiny chutney network...

Once we have such a testing mechanism, it will be much easier to bring this code back with torrc options like GuardFractionV2File (since in #16255 we already discussed changing the format slightly), rather than ripping it out and starting over completely from scratch. We need the guardfraction feature (or something like it) less since we have backed off from infinite guard lifetimes, but I don't think the need is now zero.

comment:15 in reply to:  14 ; Changed 10 months ago by arma

Replying to mikeperry:

We need the guardfraction feature (or something like it) less since we have backed off from infinite guard lifetimes, but I don't think the need is now zero.

I think this is right.

But when considering the guardfraction feature, we should think not just about the code on the Tor side, but also about what feels to the dir auths like yet another pile of abandonware crap that they're going to be asked to run and maintain forever.

So, if we're thinking about the feature's future, let's not underweight the importance of the external portion of the tool too.

comment:16 in reply to:  15 Changed 10 months ago by teor

Replying to arma:

Replying to mikeperry:

We need the guardfraction feature (or something like it) less since we have backed off from infinite guard lifetimes, but I don't think the need is now zero.

I think this is right.

We will never have perfectly accurate bandwidth allocations. So I think we need to decide how much inaccuracy we are willing to tolerate in the bandwidth system. (Currently, we tolerate 30-50% variance between bandwidth authorities. But we don't have any good way of working out the difference between the median measured bandwidth, and the actual bandwidth of a relay.)

Once we know our tolerance, it will help us prioritise fixing guardfraction.

It will also help us evaluate bandwidth authority configurations and implementations.

But when considering the guardfraction feature, we should think not just about the code on the Tor side, but also about what feels to the dir auths like yet another pile of abandonware crap that they're going to be asked to run and maintain forever.

So, if we're thinking about the feature's future, let's not underweight the importance of the external portion of the tool too.

+1

comment:17 Changed 10 months ago by teor

Also, regardless of whether we fixe guardfraction in the future, we need to remove all the old consensus methods affected by #16255.

Edit: future tense

Last edited 10 months ago by teor (previous) (diff)

comment:18 Changed 10 months ago by nickm

Keywords: review-group-33 added

comment:19 in reply to:  14 ; Changed 10 months ago by asn

Replying to mikeperry:

After catching up on the related bugs (#13297, #16255), it seems like the reason this is being killed is because of the lack of unit test support to take in all inputs for a consensus (votes, bw auth files, guardfraction files) and produce a consensus, and then examine that consensus for expected results for both values and client usage. If we had that, it would be much easier to verify that the fix in #16255 is sufficient, right?

If that is true, then I would rather obsolete the torrc options for now without removing any code, and file a ticket for better consensus testing support. There have been several other path selection tickets where I wished I just had a full consensus to check behavior against, rather than cobbling together a mock or tiny chutney network...

Once we have such a testing mechanism, it will be much easier to bring this code back with torrc options like GuardFractionV2File (since in #16255 we already discussed changing the format slightly), rather than ripping it out and starting over completely from scratch. We need the guardfraction feature (or something like it) less since we have backed off from infinite guard lifetimes, but I don't think the need is now zero.

We opted for removing the code instead of just obsoleting the torrc options so that we also simplify the codebase as a result, given that the feature has been unused for ages, and that it requires yet another dirauth script to support. That said, the guardfraction tor code hasn't caused bugs to the rest of the codebase and is quite well isolated, so AFAIK the biggest issue right now is that some dirauths still run the python script and it's burping or filling their disks space.

I can see some ways forward here:

a) We go ahead with the plan of comment:6 (and my branch) to kill this feature completely, in the assumption that we lack the time and power to fix and support it in the medium-term future. If in 3-4 years we decide to restart this project we would have to merge the code in again, add more consensus methods, etc.
b) We keep the code in (and maybe obsolete the torrc options) and ask all dirauths to stop running the python script until further notice. This would be done in the assumption that we would be able to revive this project in the future. That said, I think currently the fire department is well out of vehicles to even think of reviving this project, let alone start doing it.

Is there a different approach we could take? What should we do?

comment:20 in reply to:  19 Changed 10 months ago by mikeperry

Replying to asn:

I can see some ways forward here:

a) We go ahead with the plan of comment:6 (and my branch) to kill this feature completely, in the assumption that we lack the time and power to fix and support it in the medium-term future. If in 3-4 years we decide to restart this project we would have to merge the code in again, add more consensus methods, etc.
b) We keep the code in (and maybe obsolete the torrc options) and ask all dirauths to stop running the python script until further notice. This would be done in the assumption that we would be able to revive this project in the future. That said, I think currently the fire department is well out of vehicles to even think of reviving this project, let alone start doing it.

Is there a different approach we could take? What should we do?

I don't see any clear reason to actually remove the code. As I see it, the following are all true:

  1. #16255 stalled because we lacked an easy-to-use testing mechanism to test consensus changes. The fix looks otherwise correct to me. I agree that it is a PITA to test. But as soon as we can test it, it is likely to work.
  2. We are likely to build consensus-testing code at some point, regardless of this feature. It may be useful for vanguards tests. It will be useful for load balancing changes for padding (#8453)
  3. We do want guardfraction eventually.
  4. Future guardfraction implementations will very likely still be an external script+input file because the long-term statekeeping involved is much easier externally.
  5. The tor-core code for the guardfraction system is isolated and is not adding complexity while disabled.
  6. If we deprecate/disable the torrc options, dirauths will turn it off, perhaps automatically, on the same timescale as if we actually removed the code.

To me, this all points towards obsoleting the torrc options (perhaps in a way that causes them to stop the files from being read even if they are still set), but not removing the code.

comment:21 Changed 10 months ago by teor

If mike is correct, and the guardfraction code isolated and won't require maintenance effort, we can remove the code now, and "git revert" that commit later.

If a revert won't work later, then the code does actually require maintainence effort.

comment:22 Changed 9 months ago by nickm

Keywords: review-group-34 added

comment:23 Changed 9 months ago by nickm

Keywords: 033-triage-20180320 added

Marking all tickets reached by current round of 033 triage.

comment:24 Changed 9 months ago by nickm

Keywords: 033-included-20180320 added

Mark needs_review tickets as included for 033.

We can revisit this if they turn out to need much more review and/or revision than expected.

comment:25 in reply to:  21 ; Changed 9 months ago by mikeperry

Replying to teor:

If mike is correct, and the guardfraction code isolated and won't require maintenance effort, we can remove the code now, and "git revert" that commit later.

If a revert won't work later, then the code does actually require maintainence effort.

Having maintained a fork before, I can say that this is not true. Diffs that are otherwise untouched can be broken simply by modifying or relocating surrounding but unrelated lines.

In particular, this code touches the consensus weights. When we update these weights for padding in #8453, it will be much easier to avoid making changes that would break or simply conflict with this code if it is right there where we can see it. When we perform those padding updates (and other consensus changes), we're likely to want to have a testing framework, which will make verifying the fix in #16255 easier as well.

If we take the minimum change here (ie asn/bug24456_032), the OBSOLETE macro will cause these config entries to be ignored, so that as soon as it is deployed, dirauths will stop reading the guardfraction files automatically.

comment:26 Changed 9 months ago by arma

Description: modified (diff)

comment:27 in reply to:  25 ; Changed 9 months ago by arma

Replying to mikeperry:

If we take the minimum change here (ie asn/bug24456_032), the OBSOLETE macro will cause these config entries to be ignored, so that as soon as it is deployed, dirauths will stop reading the guardfraction files automatically.

I believe no dir auths are actually attaching their guardfraction file to their dir auth.

That is, some of us are still running the external guardfraction tool, because people told us to start running it, but nobody ever told us how to hook it up to our dir auths, so we never did.

comment:28 in reply to:  27 Changed 9 months ago by asn

Replying to arma:

Replying to mikeperry:

If we take the minimum change here (ie asn/bug24456_032), the OBSOLETE macro will cause these config entries to be ignored, so that as soon as it is deployed, dirauths will stop reading the guardfraction files automatically.

I believe no dir auths are actually attaching their guardfraction file to their dir auth.

That is, some of us are still running the external guardfraction tool, because people told us to start running it, but nobody ever told us how to hook it up to our dir auths, so we never did.

Yeah, that's part of this: We need to tell to all dirauths that are running the external guardiness script that they should stop running it (including you) since it's not gonna be used any time soon. What's the best way to contact all the dirauths? Can I send an email to the dirauth list or is it a closed one?

(BTW, wrt "we never did" to tor, we did that once 3 years ago and that's how we learned about #16255)

Last edited 9 months ago by asn (previous) (diff)

comment:29 Changed 8 months ago by asn

Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final
Status: needs_reviewnew

Moving this to 034 in terms of little-t-tor changes. In the meanwhile, I'll ensure that no dirauths are running the guardiness script anymore (I think that's true, but I will make extra sure).

comment:30 in reply to:  29 Changed 8 months ago by arma

Replying to asn:

I'll ensure that no dirauths are running the guardiness script anymore (I think that's true, but I will make extra sure).

moria1 was running it until I read this comment. I just turned it off.

Should I do anything with my 2.3 gigabyte guardfraction.db file? Should I delete it? :)

comment:31 Changed 7 months ago by asn

Milestone: Tor: 0.3.4.x-finalTor: unspecified

Now that dirauths are not running the guardiness script anymore, we can defer this ticket for the future, until we are ready to potentially revive the guardiness feature.

Note: See TracTickets for help on using tickets.