Opened 8 years ago

Closed 3 months ago

Last modified 2 months ago

#4631 closed defect (implemented)

Idea to make consensus voting more resistant

Reported by: Sebastian Owned by: teor
Priority: Medium Milestone: Tor: 0.4.4.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: needs-dirauth-email, needs-torspec-update, tor-dirauth, robustness, voting, network-team-roadmap-2020Q1
Cc: ln5, teor, nickm Actual Points: 0.2
Parent ID: #33050 Points: 0.5
Reviewer: nickm Sponsor: Sponsor55-can

Description (last modified by teor)

This is an idea how to improve the current situation, where sometimes a directory authority is slow to get its vote out to the other dirauths, and so the dirauths don't all have the same sets of votes. To simplify, I'm illustrating with an example of three dirauths:

at :50, all dirauths make their vote and start uploading. auth1 and auth2 get their vote to all auths, but auth3 doesn't. it cannot publish a vote to auth1 at all, and it takes more than 2.5 minutes to publish its vote to auth2. at :52:30, all auths try fetching the votes they're missing from the other auths, so auth1 asks auth2 for auth3's vote, and auth2 asks auth1 for auth3's vote. auth3 asks nobody, and nobody asks auth3. At this point, neither auth1 nor auth2 have auth3's vote. auth3 now (at, for example, :53:30) succeeds publishing to auth2, so auth1 now has a vote from auth1 and auth2, auth2 and auth3 have a vote from auth1, auth2, and auth3. At :55 the auths try to make a consensus, but auth1 will end up with a different consensus than auth2 and auth3.

My idea to make this less of a problem would be that only for two minutes will we accept a vote that gets pushed to us, and anything we get later than that is considered "too late" and will be dropped. At :52:30 minutes, we still go ahead and try and fetch all votes from all the other authorities, and if they have a vote we will accept it. We repeat that fetching of all votes that we don't have at 53:00, 53:30, 54:00 and 54:30. That way, a delayed publication of the original vote will not cause this kind of split, where the dirauths have different opinions on who has voted, so only the dirauth that took more than 2 minutes to publish its descriptor to any of the other dirauths will be affected. There's still a race condition here, which is when a dirauth (within two minutes) only publishes to one other dirauth, and then that dirauth gets so slow it cannot get the vote to any of the other votes. But since it was fast enough to get the vote the first time, hopefully that's rather rare.

Does this all sound viable? Am I overlooking something?

Update: This bug was introduced in Tor 0.2.0.5-alpha, with the v3 authority voting code.

Child Tickets

TicketStatusOwnerSummaryComponent
#27146closedMismatched digest in 0.3.3.9 and master mixed chutney networkCore Tor/Tor

Change History (30)

comment:1 Changed 8 years ago by ln5

Cc: ln5 added

comment:2 Changed 8 years ago by nickm

Keywords: needs-proposal added

comment:3 Changed 8 years ago by nickm

This could be related to #4826

comment:4 Changed 8 years ago by nickm

Keywords: tor-auth added

comment:5 Changed 8 years ago by nickm

Component: Tor Directory AuthorityTor

comment:6 Changed 7 years ago by nickm

Milestone: Tor: 0.2.4.x-finalTor: unspecified
Type: defectenhancement

comment:7 Changed 3 years ago by dgoulet

Keywords: tor-dirauth added; tor-auth removed

Turns out that tor-auth is for directory authority so make it clearer with tor-dirauth

comment:8 Changed 3 years ago by nickm

Keywords: robustness voting added
Severity: Normal

comment:9 Changed 4 months ago by arma

I have been running my ticket4631-demo branch on moria1 lately, and it is catching and handling real buggy behavior by other directory authorities:

Jan 28 15:59:50.804 [warn] Rejecting vote from 199.58.81.140 received at 2020-01-28 20:59:50; our cutoff for received votes is 2020-01-28 20:52:30
Jan 28 15:59:50.805 [warn] Rejected vote from 199.58.81.140 ("Vote received too late, would be dangerous to count it").
Jan 29 01:52:52.667 [warn] Rejecting vote from 204.13.164.118 received at 2020-01-29 06:52:52; our cutoff for received votes is 2020-01-29 06:52:30
Jan 29 01:52:52.669 [warn] Rejected vote from 204.13.164.118 ("Vote received too late, would be dangerous to count it").
Jan 29 04:53:26.323 [warn] Rejecting vote from 204.13.164.118 received at 2020-01-29 09:53:26; our cutoff for received votes is 2020-01-29 09:52:30
Jan 29 04:53:26.326 [warn] Rejected vote from 204.13.164.118 ("Vote received too late, would be dangerous to count it").

It would be swell for somebody to turn it into a branch that the network team is willing to merge. I don't think it would fully resolve this ticket but it is a nice step.

And in the mean time, other dir auths, if they're ok with manual patching, might enjoy running it.

comment:10 Changed 4 months ago by dgoulet

Milestone: Tor: unspecifiedTor: 0.4.4.x-final
Owner: set to dgoulet
Status: newaccepted

comment:11 Changed 4 months ago by teor

Cc: teor added

comment:12 Changed 4 months ago by dgoulet

Cc: teor removed

Made an attempt at a spec modification. I wonder if the explanation makes sense or could benefit more meat around the bone or not?

Branch: ticket4631_01

comment:13 Changed 4 months ago by dgoulet

Cc: teor added

comment:14 Changed 4 months ago by teor

Status: acceptedneeds_revision

Looks good to me, but this wording might be easier to understand:
"But they stop accepting votes posted to them."

Some extra explanation might be useful, to cover the cases where:

  • another authority A uploads its vote to some authorities after 52:30, and
  • this authority fetches that vote from:
    • the same authority A, or
    • a third authority B.

I would be happy with a spec patch, rather than a proposal, but let's send it to tor-dev with a quick explanation?

comment:15 Changed 3 months ago by teor

Also, arma's ticket4631-demo branch needs revision: it modifies the code, but the unit tests don't compile.

comment:16 Changed 3 months ago by teor

Cc: nickm added
Keywords: needs-dirauth-email needs-torspec-update added; needs-proposal removed
Owner: changed from dgoulet to teor
Points: 0.5
Sponsor: Sponsor55-can
Status: needs_revisionassigned

It looks like this patch fixes some chutney failures. It replaces "mismatched digest" with "vote received too late".

I'm going to see if I can get the unit tests compiling, and a spec patch.

Nick, should we activate this new code:

  • on all directory authorities, or
  • just on directory authorities in test networks?

arma has been running it on moria1 for some time now.

comment:17 Changed 3 months ago by teor

Parent ID: #33050

comment:18 Changed 3 months ago by teor

Actual Points: 0.2
Description: modified (diff)
Status: assignedneeds_review
Type: enhancementdefect

See my PR:

I don't think we should backport this change, it's medium-risk, and it's also very hard to test. We would need to get authorities under severe load, or with bad clock skew.

That said, it resolves at least one common issue in chutney, so it is a definite improvement.

I'll open a child ticket for the spec patch.

comment:19 Changed 3 months ago by teor

The spec patch is in #33358.

comment:20 Changed 3 months ago by dgoulet

Reviewer: nickm

comment:21 Changed 3 months ago by nickm

Reviewing the idea only, before looking at the patch:

So the idea is that we cut off the deadline for POSTed votes early, to increase the odds that fetching votes from else will give everybody the same set of votes? That's plausible, but I'm wondering whether we're replacing one race condition with another. I'm also interested in:

  • what happens during the transition period where some authorities have upgraded but others have not.
  • whether the fetch time is the right point for the POST cutoff, or whether we'd be better off having it be a second or two before.

We should also think about our diagnostics for unsynchronized consensuses. Are they good enough that we'll be able to tell whether this is helping or hurting?

Now I'll look through the patch.

comment:22 Changed 3 months ago by nickm

Status: needs_reviewneeds_revision

I've added a small comment on the patch. This looks reasonably solid.

Teor, how confident are you that this is actually helping out for the chutney case? Since you've been running into sync issues on chutney, I guess you've probably been testing this patch in that case?

comment:23 in reply to:  22 Changed 3 months ago by teor

Status: needs_revisionneeds_review

Replying to nickm:

Reviewing the idea only, before looking at the patch:

So the idea is that we cut off the deadline for POSTed votes early, to increase the odds that fetching votes from else will give everybody the same set of votes? That's plausible, but I'm wondering whether we're replacing one race condition with another.

I think we *are* replacing one race condition with another, but in practice, the replacement is less risky. Here's why:

  • in the first phase, each authority uploads *its own vote* to each other authority (1->N)
  • in the second phase, each authority downloads *all votes* from each other authority (N<-N)

So the second phase ensures that all well-connected authorities have the same set of votes. But it requires "no partial, late uploads", because each partial, late upload splits the set of authorities.

This patch enforces the "no late uploads" rule, which implies "no partial, late uploads".

I'm also interested in:

  • what happens during the transition period where some authorities have upgraded but others have not.

Some authorities reject late uploads, others do not.

If the late upload is a full, late upload, then the set of authorities splits between:

  • authorities that reject the late upload, and
  • authorities that accept the late upload.

The mitigation is to ensure that all authorities' clocks are synchronised within ~5 minutes.

Right now, all authority clocks are synchronised to within ~1 second:
https://consensus-health.torproject.org/#authorityclocks

If the late upload is a partial, late upload, then the set of authorities is already split. The authorities that reject the late upload are protected from the split. (Strictly, they reject the late upload, and end up in the set of authorities that didn't get the vote).

In my chutney tests, this patch replaces "mistmatched digest" (consensus failure) with "Vote received too late" (late vote rejection, consensus success).

I've also done some testing in mixed 0.3.5 and master networks (equal numbers of authorities). I haven't seen a consensus failure in a mixed network after applying this patch, they were common before.

  • whether the fetch time is the right point for the POST cutoff, or whether we'd be better off having it be a second or two before.

The cutoff time is the time that the entire vote has been received. So the current code works as-is. But ideally, we would want multiple authorities to receive each vote before the cutoff.

For chutney (and other fast networks), the current cutoff should not be changed: the intervals are only a few seconds long.

For the public network (and other slow networks), votes are about 3.5 MB, and let's say authorities have about 10 Mbps of bandwidth spare for votes. So that's:
3 seconds per vote upload * 8 authorities = 24 seconds

So let's set the cutoff to: fetch_missing_votes - vote_delay/6 ?

Note that fetch_missing_votes is already start - dist_delay - (vote_delay/2), so we can't go any earlier than fetch_missing_votes - vote_delay/4.

Also, fast networks need about 4 seconds to vote reliably (upload, process, download, process), so the extra cutoff should be at most vote_delay/5.

Interestingly, the "successful vote upload" bandwidth cutoff for public authorities is about:
3.5 MB * 8 vote uploads / 300 seconds * 8 bits = 0.75 Mbit/s

We should also think about our diagnostics for unsynchronized consensuses. Are they good enough that we'll be able to tell whether this is helping or hurting?

Yes, I've seen the diagnostics change in chutney from "mistmatched digest" to "Vote received too late".

If rejecting late votes splits the consensus, we'll see both "Vote received too late" and "mistmatched digest" for the same consensus period.

Replying to nickm:

Teor, how confident are you that this is actually helping out for the chutney case? Since you've been running into sync issues on chutney, I guess you've probably been testing this patch in that case?

It is surprisingly effective. I haven't seen a consensus failure in ~100 chutney runs, since applying this patch.

I've added a small comment on the patch. This looks reasonably solid.

I've fixed some of the logging in the patch.

Our remaining work is to choose the exact threshold.

comment:24 Changed 3 months ago by teor

  • what happens during the transition period where some authorities have upgraded but others have not.

I have just seen one chutney network where the old authorities fail to make a consensus, but the new authorities ignore the late votes.

It's a half old, half new network, so it would have failed without this patch, and it still failed with it.

But if a majority of authorities had been running this patch, the consensus might have succeeded.

comment:25 Changed 3 months ago by nickm

Status: needs_reviewmerge_ready

okay, this is looking reasonable to me. Let's squash and merge it after the meeting.

comment:26 Changed 3 months ago by teor

Do you want me to change the cutoff to fetch_missing_votes - vote_delay/6 ?

comment:27 Changed 3 months ago by nickm

Let's try it as it stands.

comment:28 Changed 3 months ago by nickm

Squashed & merged!

comment:29 Changed 3 months ago by nickm

Resolution: implemented
Status: merge_readyclosed

comment:30 Changed 2 months ago by gaba

Keywords: network-team-roadmap-2020Q1 added

Add all the tickets from sponsor 55 that are done and being worked on to the keyword #network-team-roadmap-2020Q1 so I can look at them in the wiki page...

Note: See TracTickets for help on using tickets.