Opened 3 months ago

Last modified 12 days ago

#24815 assigned defect

Validate shared random state dates before each voting period

Reported by: teor Owned by: asn
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.2.9.1-alpha
Severity: Normal Keywords: tor-sr, 031-backport, 032-backport, tor-dos, 034-triage-20180328, 034-removed-20180328
Cc: teor, asn Actual Points:
Parent ID: Points: 1
Reviewer: Sponsor:

Description

When tor's clock jumps, the shared random state can get out of sync with the current round.

This happens to me on a test authority on a laptop that sleeps regularly. But it could also happen to authorities that miss a voting round because they are under heavy load. (I have seen clock jumps happen on Linux and BSD under the recent heavy load.)

I see this message when I restart my authority:

Jan 07 09:48:13.179 [info] or_state_load: Loaded state from "/Users/USER/tor/auth/auth-sr-3e/state"
Jan 07 09:48:14.977 [info] disk_state_validate: SR: Disk state valid after/until times are invalid.
Jan 07 09:48:14.984 [info] sr_state_update: SR: State prepared for upcoming voting period (2018-01-06 23:00:00). Upcoming phase is reveal (counters: 0 commit & 1 reveal rounds).
Jan 07 09:48:16.026 [info] or_state_save: Saved state to "/Users/USER/tor/auth/auth-sr-3e/state"

Even though this message was logged just before I killed it:

Jan 07 09:47:45.328 [info] or_state_save: Saved state to "/Users/USER/tor/auth/auth-sr-3e/state"

Child Tickets

TicketStatusOwnerSummaryComponent
#24849merge_readynickmAdded -1 signatures to consensusCore Tor/Tor

Change History (16)

comment:1 Changed 3 months ago by dgoulet

Status: newneeds_information

I suppose you are talking about the log on Jan 7th but the upcoming round on Jan 6th:

Jan 07 09:48:14.984 [info] sr_state_update: SR: State prepared for upcoming voting period (2018-01-06 23:00:00). Upcoming phase is reveal (counters: 0 commit & 1 reveal rounds).

That time (2018-01-06 23:00:00) is the valid_after from the consensus and the SR subsystem only uses the time from the consensus to take every timing decision. It updates the state when loading from disk (at bootup) or when a new consensus has just been computed.

In this case, it is when booting up (sr_init()) I presume? So it takes the consensus from disk (very old), and tries to vote with that information. Ultimately it will fail but then once your dirauth gets the latest consensus, it should sync up again with the whole dance at the next voting round.

Do you see that or I'm misunderstanding the issue or ?

comment:2 in reply to:  1 ; Changed 3 months ago by teor

Replying to dgoulet:

I suppose you are talking about the log on Jan 7th but the upcoming round on Jan 6th:

Jan 07 09:48:14.984 [info] sr_state_update: SR: State prepared for upcoming voting period (2018-01-06 23:00:00). Upcoming phase is reveal (counters: 0 commit & 1 reveal rounds).

That time (2018-01-06 23:00:00) is the valid_after from the consensus and the SR subsystem only uses the time from the consensus to take every timing decision. It updates the state when loading from disk (at bootup) or when a new consensus has just been computed.

In this case, it is when booting up (sr_init()) I presume?

No, waking from sleep.

So it takes the consensus from disk (very old), and tries to vote with that information. Ultimately it will fail but then once your dirauth gets the latest consensus, it should sync up again with the whole dance at the next voting round.

Do you see that or I'm misunderstanding the issue or ?

There are times when it never syncs up.
Sometimes it will be out of sync for hours, and other times it won't get back in sync until I restart the Tor process.

I need to do some debugging to find out why it produces a different consensus to everyone else.
I assumed it was shared random because of these warning messages.

comment:3 in reply to:  2 Changed 3 months ago by dgoulet

Replying to teor:

There are times when it never syncs up.
Sometimes it will be out of sync for hours, and other times it won't get back in sync until I restart the Tor process.

If you can attach the info logs when it is out of sync, I can look at it, in theory if the consensus you download is the latest from the network, this shouldn't happened :S.

comment:4 Changed 3 months ago by dgoulet

Owner: set to dgoulet
Status: needs_informationassigned

comment:5 Changed 3 months ago by teor

I can't do that, because the network addresses are supposed to be secret.
I'll email them instead.

comment:6 Changed 2 months ago by dgoulet

Cc: teor added
Status: assignedneeds_information

From the log you sent me, I noticed that the voting period are synchronized. Is it possible we are in a timezone issue situation here because if I take your local timezone and convert it to UTC, the logs do add up:

Jan 07 09:48:14.984 [info] sr_state_update: SR: State prepared for upcoming voting period (2018-01-06 23:00:00). Upcoming phase is reveal (counters: 0 commit & 1 reveal rounds).

09:48 Australia Time -> 22:48 UTC so the upcoming voting period is indeed 23:00 UTC.

comment:7 Changed 2 months ago by teor

It looks like we have a timezone issue when we write or read dates from the state file.
If you say the dates are valid, but the log says that they aren't, then we need to check that code.

Do you want to check it?
I won't get a chance until next week at the earliest.

comment:8 Changed 2 months ago by teor

Keywords: 031-backport 032-backport added
Status: needs_informationnew

I had a quick look.
This warning:

[info] disk_state_validate: SR: Disk state valid after/until times are invalid.

happens when:

  /* Make sure we don't have a valid after time that is earlier than a valid
   * until time which would make things not work well. */
  if (state->ValidAfter >= state->ValidUntil) {
    log_info(LD_DIR, "SR: Disk state valid after/until times are invalid.");
    goto invalid;
  }

The bug is in get_state_valid_until_time():

  voting_interval = get_voting_interval();
  /* Find the time the current round started. */
  beginning_of_current_round = get_start_time_of_current_round();

  /* Find how many rounds are left till the end of the protocol run */
  current_round = (now / voting_interval) % total_rounds;
  rounds_left = total_rounds - current_round;

  /* To find the valid-until time now, take the start time of the current
   * round and add to it the time it takes for the leftover rounds to
   * complete. */
  valid_until = beginning_of_current_round + (rounds_left * voting_interval);

If rounds_left is zero, then valid_until is beginning_of_current_round.
But ValidAfter is set to now in disk_state_new(), which can be after beginning_of_current_round.

This is probably a backport candidate, because if all the authorities ever have to restart near the end of a SR cycle, we could lose the SR for that cycle. Fortunately, it only affects voting, so we don't have to worry about breaking consensus.

comment:9 Changed 2 months ago by dgoulet

Cc: asn added

comment:10 in reply to:  8 Changed 2 months ago by asn

Replying to teor:

The bug is in get_state_valid_until_time():

  voting_interval = get_voting_interval();
  /* Find the time the current round started. */
  beginning_of_current_round = get_start_time_of_current_round();

  /* Find how many rounds are left till the end of the protocol run */
  current_round = (now / voting_interval) % total_rounds;
  rounds_left = total_rounds - current_round;

  /* To find the valid-until time now, take the start time of the current
   * round and add to it the time it takes for the leftover rounds to
   * complete. */
  valid_until = beginning_of_current_round + (rounds_left * voting_interval);

If rounds_left is zero, then valid_until is beginning_of_current_round.

Hmm that was a pretty good hypothesis. However, can rounds_left actually be 0? Due to the modulo in the current_round calculation, it's guaranteed that current_round will be between 0 and total_rounds-1, so rounds_left can never be 0.

Pretty close, but I think the current bug argument won't work exactly as is.

comment:11 Changed 2 months ago by asn

Owner: changed from dgoulet to asn
Status: newassigned

comment:12 Changed 2 months ago by dgoulet

Keywords: tor-dos added; tor-ddos removed

Consolidate our DoS keywords to only use "tor-dos"

comment:13 Changed 8 weeks ago by dgoulet

Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final

comment:14 Changed 3 weeks ago by nickm

Keywords: 034-triage-20180328 added

comment:15 Changed 3 weeks ago by nickm

Keywords: 034-removed-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

comment:16 Changed 12 days ago by nickm

Milestone: Tor: 0.3.4.x-finalTor: unspecified

These tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if time permits.

Note: See TracTickets for help on using tickets.