Opened 3 months ago
Last modified 12 days ago
#24815 assigned defect
Validate shared random state dates before each voting period
Reported by: | teor | Owned by: | asn |
---|---|---|---|
Priority: | Medium | Milestone: | Tor: unspecified |
Component: | Core Tor/Tor | Version: | Tor: 0.2.9.1-alpha |
Severity: | Normal | Keywords: | tor-sr, 031-backport, 032-backport, tor-dos, 034-triage-20180328, 034-removed-20180328 |
Cc: | teor, asn | Actual Points: | |
Parent ID: | Points: | 1 | |
Reviewer: | Sponsor: |
Description
When tor's clock jumps, the shared random state can get out of sync with the current round.
This happens to me on a test authority on a laptop that sleeps regularly. But it could also happen to authorities that miss a voting round because they are under heavy load. (I have seen clock jumps happen on Linux and BSD under the recent heavy load.)
I see this message when I restart my authority:
Jan 07 09:48:13.179 [info] or_state_load: Loaded state from "/Users/USER/tor/auth/auth-sr-3e/state" Jan 07 09:48:14.977 [info] disk_state_validate: SR: Disk state valid after/until times are invalid. Jan 07 09:48:14.984 [info] sr_state_update: SR: State prepared for upcoming voting period (2018-01-06 23:00:00). Upcoming phase is reveal (counters: 0 commit & 1 reveal rounds). Jan 07 09:48:16.026 [info] or_state_save: Saved state to "/Users/USER/tor/auth/auth-sr-3e/state"
Even though this message was logged just before I killed it:
Jan 07 09:47:45.328 [info] or_state_save: Saved state to "/Users/USER/tor/auth/auth-sr-3e/state"
Child Tickets
Ticket | Status | Owner | Summary | Component |
---|---|---|---|---|
#24849 | merge_ready | nickm | Added -1 signatures to consensus | Core Tor/Tor |
Change History (16)
comment:1 follow-up: 2 Changed 3 months ago by
Status: | new → needs_information |
---|
comment:2 follow-up: 3 Changed 3 months ago by
Replying to dgoulet:
I suppose you are talking about the log on Jan 7th but the upcoming round on Jan 6th:
Jan 07 09:48:14.984 [info] sr_state_update: SR: State prepared for upcoming voting period (2018-01-06 23:00:00). Upcoming phase is reveal (counters: 0 commit & 1 reveal rounds).That time (
2018-01-06 23:00:00
) is thevalid_after
from the consensus and the SR subsystem only uses the time from the consensus to take every timing decision. It updates the state when loading from disk (at bootup) or when a new consensus has just been computed.
In this case, it is when booting up (
sr_init()
) I presume?
No, waking from sleep.
So it takes the consensus from disk (very old), and tries to vote with that information. Ultimately it will fail but then once your dirauth gets the latest consensus, it should sync up again with the whole dance at the next voting round.
Do you see that or I'm misunderstanding the issue or ?
There are times when it never syncs up.
Sometimes it will be out of sync for hours, and other times it won't get back in sync until I restart the Tor process.
I need to do some debugging to find out why it produces a different consensus to everyone else.
I assumed it was shared random because of these warning messages.
comment:3 Changed 3 months ago by
Replying to teor:
There are times when it never syncs up.
Sometimes it will be out of sync for hours, and other times it won't get back in sync until I restart the Tor process.
If you can attach the info logs when it is out of sync, I can look at it, in theory if the consensus you download is the latest from the network, this shouldn't happened :S.
comment:4 Changed 3 months ago by
Owner: | set to dgoulet |
---|---|
Status: | needs_information → assigned |
comment:5 Changed 3 months ago by
I can't do that, because the network addresses are supposed to be secret.
I'll email them instead.
comment:6 Changed 2 months ago by
Cc: | teor added |
---|---|
Status: | assigned → needs_information |
From the log you sent me, I noticed that the voting period are synchronized. Is it possible we are in a timezone issue situation here because if I take your local timezone and convert it to UTC, the logs do add up:
Jan 07 09:48:14.984 [info] sr_state_update: SR: State prepared for upcoming voting period (2018-01-06 23:00:00). Upcoming phase is reveal (counters: 0 commit & 1 reveal rounds).
09:48 Australia Time -> 22:48 UTC so the upcoming voting period is indeed 23:00 UTC.
comment:7 Changed 2 months ago by
It looks like we have a timezone issue when we write or read dates from the state file.
If you say the dates are valid, but the log says that they aren't, then we need to check that code.
Do you want to check it?
I won't get a chance until next week at the earliest.
comment:8 follow-up: 10 Changed 2 months ago by
Keywords: | 031-backport 032-backport added |
---|---|
Status: | needs_information → new |
I had a quick look.
This warning:
[info] disk_state_validate: SR: Disk state valid after/until times are invalid.
happens when:
/* Make sure we don't have a valid after time that is earlier than a valid * until time which would make things not work well. */ if (state->ValidAfter >= state->ValidUntil) { log_info(LD_DIR, "SR: Disk state valid after/until times are invalid."); goto invalid; }
The bug is in get_state_valid_until_time():
voting_interval = get_voting_interval(); /* Find the time the current round started. */ beginning_of_current_round = get_start_time_of_current_round(); /* Find how many rounds are left till the end of the protocol run */ current_round = (now / voting_interval) % total_rounds; rounds_left = total_rounds - current_round; /* To find the valid-until time now, take the start time of the current * round and add to it the time it takes for the leftover rounds to * complete. */ valid_until = beginning_of_current_round + (rounds_left * voting_interval);
If rounds_left is zero, then valid_until is beginning_of_current_round.
But ValidAfter is set to now in disk_state_new(), which can be after beginning_of_current_round.
This is probably a backport candidate, because if all the authorities ever have to restart near the end of a SR cycle, we could lose the SR for that cycle. Fortunately, it only affects voting, so we don't have to worry about breaking consensus.
comment:9 Changed 2 months ago by
Cc: | asn added |
---|
comment:10 Changed 2 months ago by
Replying to teor:
The bug is in get_state_valid_until_time():
voting_interval = get_voting_interval(); /* Find the time the current round started. */ beginning_of_current_round = get_start_time_of_current_round(); /* Find how many rounds are left till the end of the protocol run */ current_round = (now / voting_interval) % total_rounds; rounds_left = total_rounds - current_round; /* To find the valid-until time now, take the start time of the current * round and add to it the time it takes for the leftover rounds to * complete. */ valid_until = beginning_of_current_round + (rounds_left * voting_interval);If rounds_left is zero, then valid_until is beginning_of_current_round.
Hmm that was a pretty good hypothesis. However, can rounds_left
actually be 0? Due to the modulo in the current_round
calculation, it's guaranteed that current_round
will be between 0 and total_rounds-1
, so rounds_left
can never be 0.
Pretty close, but I think the current bug argument won't work exactly as is.
comment:11 Changed 2 months ago by
Owner: | changed from dgoulet to asn |
---|---|
Status: | new → assigned |
comment:12 Changed 2 months ago by
Keywords: | tor-dos added; tor-ddos removed |
---|
Consolidate our DoS keywords to only use "tor-dos"
comment:13 Changed 8 weeks ago by
Milestone: | Tor: 0.3.3.x-final → Tor: 0.3.4.x-final |
---|
comment:14 Changed 3 weeks ago by
Keywords: | 034-triage-20180328 added |
---|
comment:15 Changed 3 weeks ago by
Keywords: | 034-removed-20180328 added |
---|
Per our triage process, these tickets are pending removal from 0.3.4.
comment:16 Changed 12 days ago by
Milestone: | Tor: 0.3.4.x-final → Tor: unspecified |
---|
These tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if time permits.
I suppose you are talking about the log on Jan 7th but the upcoming round on Jan 6th:
That time (
2018-01-06 23:00:00
) is thevalid_after
from the consensus and the SR subsystem only uses the time from the consensus to take every timing decision. It updates the state when loading from disk (at bootup) or when a new consensus has just been computed.In this case, it is when booting up (
sr_init()
) I presume? So it takes the consensus from disk (very old), and tries to vote with that information. Ultimately it will fail but then once your dirauth gets the latest consensus, it should sync up again with the whole dance at the next voting round.Do you see that or I'm misunderstanding the issue or ?