#27302 closed defect (wontfix)

Duplicate votes on 0.3.4 and later

Reported by: teor Owned by: teor
Priority: Medium Milestone: Tor: 0.3.4.x-final
Component: Core Tor/Tor Version: Tor: 0.3.4.1-alpha
Severity: Normal Keywords: regression
Cc: Actual Points:
Parent ID: #27303 Points:
Reviewer: Sponsor:

Description

Tor 0.3.4 changed periodic event timings.

Occasionally, this means authorities send a duplicate vote:

Detail: chutney/tools/warnings.sh /Users/base/chutney/net/nodes.1535124133
Warning: Rejected vote from 127.0.0.1 ("Duplicate discarded"). Number: 1

Maybe we should avoid sending the same vote twice?
(Or maybe not, if the remote authority restarts.)

Child Tickets

Change History (8)

comment:1 Changed 12 months ago by arma

Does it happen for two authorities that have been running the whole time? If so, that's weird (and potentially a bug), since it ought to only trigger the event when the right time arrives.

If it happens when a dir auth has restarted at just the right time, then yes this is expected and normal.

comment:2 in reply to:  1 Changed 12 months ago by teor

Keywords: regression 034-must added
Milestone: Tor: unspecifiedTor: 0.3.4.x-final
Priority: LowMedium
Severity: MinorNormal

Replying to arma:

Does it happen for two authorities that have been running the whole time? If so, that's weird (and potentially a bug), since it ought to only trigger the event when the right time arrives.

If it happens when a dir auth has restarted at just the right time, then yes this is expected and normal.

In chutney, authorities run the entire time. They are not restarted (or HUP'ed).

Therefore, this is a regression we must diagnose in 0.3.4.

comment:3 Changed 12 months ago by teor

I looked at the logs, and there aren't actually any duplicate POSTs by the sending authorities. So I don't know why 0.3.4 authorities log this message.

comment:4 Changed 12 months ago by teor

Parent ID: #27146

comment:5 in reply to:  3 ; Changed 12 months ago by arma

Replying to teor:

I looked at the logs, and there aren't actually any duplicate POSTs by the sending authorities. So I don't know why 0.3.4 authorities log this message.

Hint: Part of the process of making sure we've got all the votes is that we fetch missing votes from other dir auths. Do we just ask the others for all their votes? I don't remember the details.

comment:6 in reply to:  5 ; Changed 12 months ago by teor

Owner: set to teor
Status: newassigned

Replying to arma:

Replying to teor:

I looked at the logs, and there aren't actually any duplicate POSTs by the sending authorities. So I don't know why 0.3.4 authorities log this message.

Hint: Part of the process of making sure we've got all the votes is that we fetch missing votes from other dir auths. Do we just ask the others for all their votes? I don't remember the details.

We ask all the others for the votes that *we* don't have yet:
https://github.com/teor2345/tor/blob/master/src/feature/dirauth/dirvote.c#L2961

Which can result in duplicate votes for two reasons:

  • multiple other authorities have the vote(s) we're looking for
    • we can safely ignore this case in chutney (#27303)
  • we ask an authority for its vote, while it is uploading its vote to us
    • we can safely ignore this warning in chutney (#27303), but I also have a patch that will mitigate it in chutney, and on the public network

That usually doesn't happen in the public network, because there are usually 2.5 minutes between creating and uploading votes, and checking for missing votes.

But chutney has 1 second between creating and uploading votes. It also starts all the authorities at about the same time.

Here's why that matters:

In 0.3.3 and earlier, we called dirvote_act() from second_elapsed_callback(). So there was almost always enough time for each authority to make its vote, start the upload, and the other authorities to get the upload, before the next second_elapsed_callback().

In 0.3.4 and later, we call dirvote_act() when we start up to check the schedule. Then we call it at the end of each event loop, when a scheduled action is due. Sometimes there isn't enough time for an authority to upload, before others start downloading.

In the case where the authority starts late, all actions are scheduled to happen as soon as possible.

In 0.3.3 and earlier, all actions happen in the same callback. In most cases, we would calculate our vote, start uploading, start downloading, fail to create the consensus, and fail to publish the consensus - all in a few microseconds in the same callback.

In 0.3.4 and earlier, each scheduled action happens in successive callbacks, which are a few hundred milliseconds apart. So we can calculate our vote, start uploading, start downloading, try to create the consensus, and try to publish the consensus - and the results are very racy, because the other authorities are doing the same thing.

This issue can also happen in rare cases on the public network. If an authority starts up at just the right time. In particular, if an authority starts up just before HH:55, it will split the consensus on the other authorities. (Because some get its vote, and some don't.)

Usually, that's ok, because a majority will end up with the new authority's vote (or not end up with its vote). But if multiple authorities start up near the same time (like chutney), or there's some other split at the same time (like consensus methods), then the consensus can fail.

I have a patch that makes authorities skip creating their vote if the other authorities have already created the consensus for this period. (And similarly, they skip creating the consensus if the other authorities have already published it.) I'm still testing it. (And putting out other fires.)

comment:7 in reply to:  6 Changed 12 months ago by teor

Replying to teor:

...

In 0.3.4 and earlier, each scheduled action happens in successive callbacks, which are a few hundred milliseconds apart.

This sentence is wrong: in 0.3.4, all scheduled actions happen in the same callback, just like 0.3.3. But the issues I describe can still happen on the public network.

comment:8 Changed 12 months ago by teor

Keywords: 034-must removed
Parent ID: #27146#27303
Resolution: wontfix
Status: assignedclosed

I made some changes to chutney, and I think we should fix this issue by ignoring the warning in chutney. It's unlikely to happen in the public network. And if it does, it's not a big deal.

Note: See TracTickets for help on using tickets.