Opened 6 months ago

Closed 5 months ago

#25756 closed defect (fixed)

EARLY_CONSENSUS_NOTICE_SKEW of 60 is too strict for some drifting dirauth clocks

Reported by: Dbryrtfbcbhgf Owned by: catalyst
Priority: Medium Milestone: Tor: 0.3.4.x-final
Component: Core Tor/Tor Version: Tor: 0.2.2.25-alpha
Severity: Normal Keywords: clock-skew, s8-errors, 034-roadmap-proposed
Cc: Actual Points:
Parent ID: Points:
Reviewer: isis Sponsor: Sponsor8-can

Description

I keep getting this error on my relay, I set my timezone as Bucharest, Romania. The relay is running ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-119-generic x86_64) and using Tor 0.3.2.10

[WARN] Received ns flavor consensus with skewed time (CONSENSUS): It seems that our clock is behind by 1 minutes, 1 seconds, or that theirs is ahead. Tor requires an

  accurate clock to work: please check your time, timezone, and date settings.

17:58:59 [WARN] Our clock is 1 minutes, 1 seconds behind the time published in the consensus network status document (2018-04-09 15:00:00 UTC). Tor needs an accurate clock to

  work correctly. Please check your time and date settings!

Child Tickets

Attachments (1)

all logs.zip (7.8 KB) - added by Dbryrtfbcbhgf 6 months ago.

Download all attachments as: .zip

Change History (24)

comment:1 Changed 6 months ago by Dbryrtfbcbhgf

Here is the relay

https://metrics.torproject.org/rs.html#details/5D86AFD7CE409251E67B373B4F0E780A0F41C944
I also made sure the system time was synced.
Synchronized to time server 91.189.91.157:123 (ntp.ubuntu.com).

Last edited 6 months ago by Dbryrtfbcbhgf (previous) (diff)

comment:2 Changed 6 months ago by catalyst

Keywords: clock-skew s8-errors added
Milestone: Tor: unspecified
Sponsor: Sponsor8-can
Status: newneeds_information
Version: Tor: 0.3.2.10

The clock skew tolerance should probably be larger than 61 seconds so this might be an actual bug. I think we could use a little more information though. Is the skew the same amount each time you get that warning, or does it change?

What output do you get from ntpq -n -c sysinfo -c peers?

comment:4 Changed 6 months ago by arma

Do we have a log message around there that says where we got the consensus from?

Since this is a dir mirror, we got it from an authority, right?

There is a tiny part of me that wonders if this is because dizum's clock is 65 seconds early.

But...it *can't* be just that, right? Since this relay is seeing a consensus that was made in the future, and that means this relay's clock is set far enough in the past that all the dir auths made a consensus and timestamped it and made it available yet it was *still* in the future from this relay's perspective.

comment:5 Changed 6 months ago by arma

Ok, here's our hint: both this ticket and the tor-relays thread had this happen at 59 minutes after the hour.

comment:6 Changed 6 months ago by arma

Scenario: dizum is 65 seconds early. So it votes early, and sends a signature early, and most importantly, it makes the new consensus available early.

So if the relay here has the correct time, and it happens to ask for a consensus at the 59 minute mark, then dizum has already switched over to handing out the new consensus, and most importantly, it sticks a timestamp on the new consensus that says it came from the top of the hour. And while all the other dir auths have accurate clocks, they send their signatures early (5 minutes early) for robustness. So it doesn't matter whether they have accurate clocks, dizum can by itself produce a consensus with a timestamp in the future that their signatures on it, and it can do this theoretically as soon as it has enough signatures from dir auths for that round -- i.e. 5 minutes early if it wanted to drift more.

comment:7 Changed 6 months ago by arma

Ok, here's a plan:

  • I'm going to file a consensus-health ticket and a doctor ticket, to have us check the dir auths for clock skew, to get earlier notice when they go wrong.
  • I'm going to contact dizum's operator and get him to fix it one more time.
  • I continue to think we should change the relay consensus fetching algorithm to wait a little while when it rolls the dice and they come up between :55 and :00. Dgoulet says he has a diagram, in his little book, of when each role fetches the consensus. We should get that transcribed into dir-spec, and then build a plan for this third item.

comment:8 in reply to:  7 Changed 6 months ago by arma

Replying to arma:

  • I'm going to file a consensus-health ticket and a doctor ticket

Done: #25767 and #25768.

  • I'm going to contact dizum's operator and get him to fix it one more time.

Also done (well, initiated).

  • I continue to think we should change the relay consensus fetching algorithm to wait a little while when it rolls the dice and they come up between :55 and :00. Dgoulet says he has a diagram, in his little book, of when each role fetches the consensus. We should get that transcribed into dir-spec, and then build a plan for this third item.

Pending on dgoulet to send us the pics.

comment:9 in reply to:  2 ; Changed 6 months ago by Dbryrtfbcbhgf

Replying to catalyst:

The clock skew tolerance should probably be larger than 61 seconds so this might be an actual bug. I think we could use a little more information though. Is the skew the same amount each time you get that warning, or does it change?

What output do you get from ntpq -n -c sysinfo -c peers?

associd=0 status=c016 leap_alarm, sync_unspec, 1 event, restart,
system peer: 0.0.0.0:0
system peer mode: unspec
leap indicator: 11
stratum: 16
log2 precision: -24
root delay: 0.000
root dispersion: 0.045
reference ID: INIT
reference time: 00000000.00000000 Thu, Feb 7 2036 8:28:16.000
system jitter: 0.000000
clock jitter: 0.000
clock wander: 0.000
broadcast delay: 0.000
symm. auth. delay: 0.000

remote refid st t when poll reach delay offset jitter

==============================================================================

0.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.000
1.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.000
2.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.000
3.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.000
ntp.ubuntu.com .POOL. 16 p - 64 0 0.000 0.000 0.000

comment:10 Changed 6 months ago by Dbryrtfbcbhgf

Last edited 6 months ago by Dbryrtfbcbhgf (previous) (diff)

Changed 6 months ago by Dbryrtfbcbhgf

Attachment: all logs.zip added

comment:11 in reply to:  9 ; Changed 6 months ago by catalyst

Status: needs_informationnew

Replying to Dbryrtfbcbhgf:

Replying to catalyst:

What output do you get from ntpq -n -c sysinfo -c peers?

associd=0 status=c016 leap_alarm, sync_unspec, 1 event, restart,
system peer:        0.0.0.0:0
system peer mode:   unspec

Thanks. I'm pretty sure the above means the ntpd is not synchronized; possibly it has just restarted?

leap indicator:     11
stratum:            16
log2 precision:     -24
root delay:         0.000
root dispersion:    0.045
reference ID:       INIT
reference time:     00000000.00000000  Thu, Feb  7 2036  8:28:16.000
system jitter:      0.000000
clock jitter:       0.000
clock wander:       0.000
broadcast delay:    0.000
symm. auth. delay:  0.000
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 0.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000   0.000
 1.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000   0.000
 2.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000   0.000
 3.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000   0.000
 ntp.ubuntu.com  .POOL.          16 p    -   64    0    0.000    0.000   0.000

I think the other output is also consistent with the ntpd having just restarted.

Anyway, as arma commented above, the more likely problem seems to be a dirauth having an inaccurate clock.

comment:12 in reply to:  11 Changed 6 months ago by Dbryrtfbcbhgf

Replying to catalyst:

Replying to Dbryrtfbcbhgf:

Replying to catalyst:

What output do you get from ntpq -n -c sysinfo -c peers?

Thanks. I'm pretty sure the above means the ntpd is not synchronized; possibly it has just restarted?

I think the other output is also consistent with the ntpd having just restarted.

Anyway, as arma commented above, the more likely problem seems to be a dirauth having an inaccurate clock.

Is there anything I should do on my end? manually sync ntpd?
Like

sudo service ntp stop
sudo ntpd -gq
sudo service ntp start

source https://askubuntu.com/a/254846
Update
I ran the above command here is the result.

associd=0 status=0614 leap_none, sync_ntp, 1 event, freq_mode,
system peer:        193.27.208.100:123
system peer mode:   client
leap indicator:     00
stratum:            2
log2 precision:     -24
root delay:         70.729
root dispersion:    94.791
reference ID:       193.27.208.100
reference time:     de77aaf3.34d27925  Wed, Apr 11 2018  0:28:19.206
system jitter:      16.162191
clock jitter:       23.622
clock wander:       0.000
broadcast delay:    0.000
symm. auth. delay:  0.000
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 0.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000   0.000
 1.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000   0.000
 2.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000   0.000
 3.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000   0.000
 ntp.ubuntu.com  .POOL.          16 p    -   64    0    0.000    0.000   0.000
+46.227.200.70   90.155.74.41     3 u   17   64    3   53.618  -15.397  31.920
#178.17.160.12   162.23.41.56     2 u   19   64    3    1.972   37.222  33.064
+54.36.60.132    131.188.3.223    2 u   21   64    3   45.507  -14.822  34.655
#178.17.162.12   162.23.41.56     2 u   10   64    1   12.277   67.038   5.234
#2a00:1dc0::12   162.23.41.56     2 u   17   64    3    2.260   37.170  31.020
+78.46.93.106    126.94.231.148   2 u   14   64    3   38.454  -16.909  29.981
+193.6.222.47    121.131.112.137  2 u   12   64    3   14.051  -17.925  29.071
 217.26.163.51   129.6.15.30      2 u   17   64    3   33.883  -32.016  32.788
#2a00:1dc0:2::12 162.23.41.56     2 u   13   64    3    2.038   62.084  30.355
+51.174.131.248  129.242.4.241    2 u   13   64    3   56.278  -16.001  29.010
#178.17.161.12   162.23.41.56     2 u   18   64    3    1.893   29.749  33.004
#2a00:1dc0:1::12 162.23.41.56     2 u   15   64    3    2.032   29.540  31.144
+195.50.171.101  145.253.3.52     2 u   14   64    3   36.770  -16.139  29.035
+195.201.19.162  46.177.190.229   3 u   14   64    3   36.884  -15.937  28.842
*193.27.208.100  .PPS.            1 u   11   64    3   70.729   -4.192  26.977
+94.247.111.10   46.254.241.74    2 u   12   64    3  118.740  -10.962  26.170
+198.60.22.240   150.143.81.69    2 u   12   64    3  166.668  -17.647  29.310
Last edited 6 months ago by Dbryrtfbcbhgf (previous) (diff)

comment:13 Changed 5 months ago by catalyst

Summary: I keep getting this error on my relayEARLY_CONSENSUS_NOTICE_SKEW of 60 is too strict for some drifting dirauth clocks

Update ticket summary to better reflect the actual problem.

comment:14 Changed 5 months ago by catalyst

Keywords: 034-roadmap-proposed added

During this week's meeting, we decided it would be a good idea to relax this test to account for the voting schedule. That way for a client or relay to get a "consensus is coming from the future" warning, enough dirauths would have to have their clocks skewed by about the same amount. A single dirauth with an early clock shouldn't be able to induce this warning by releasing a consensus early. (This is what happened with dizum.)

arma says the voting-delay consensus parameter is something we want to look at if we want to not hard code this.

comment:15 Changed 5 months ago by nickm

Milestone: Tor: unspecifiedTor: 0.3.4.x-final
Owner: set to catalyst
Status: newassigned

comment:16 Changed 5 months ago by catalyst

https://github.com/tlyu/tor/tree/bug25756 contains some WIP preparatory refactoring and new tests.

comment:17 Changed 5 months ago by catalyst

From dir-spec.txt:

   VA-DistSeconds-VoteSeconds: The authorities exchange votes.

   VA-DistSeconds-VoteSeconds/2: The authorities try to download any
   votes they don't have.

   VA-DistSeconds: The authorities calculate the consensus and exchange
   signatures.

   VA-DistSeconds/2: The authorities try to download any signatures
   they don't have.

   VA: All authorities have a multiply signed consensus.

So VA-DistSeconds is the earliest that a dirauth with a skewed clock could possibly release a valid consensus, assuming that the other dirauths don't have skewed clocks. We probably shouldn't decide that a consensus is "early" before that time. (We probably should add some tolerance as well.)

If multiple dirauths have skewed clocks the story gets more complicated. Maybe it'll be rare enough to avoid thinking about for now?

comment:18 Changed 5 months ago by nickm

If multiple dirauths have skewed clocks the story gets more complicated. Maybe it'll be rare enough to avoid thinking about for now?

I'm fine with that. Certainly, if a _majority_ of authorities have skewed clocks, then our "believe the authorities" model has fallen apart.

comment:19 Changed 5 months ago by catalyst

Version: Tor: 0.3.2.10Tor: 0.2.2.25-alpha

comment:20 Changed 5 months ago by catalyst

Status: assignedneeds_review

Patches at https://github.com/torproject/tor/pull/84

The Travis failure is the existing Rust distcheck issue.

comment:21 Changed 5 months ago by dgoulet

Reviewer: isis

comment:22 in reply to:  20 Changed 5 months ago by isis

Status: needs_reviewmerge_ready

Replying to catalyst:

Patches at https://github.com/torproject/tor/pull/84

The Travis failure is the existing Rust distcheck issue.


LGTM!

comment:23 Changed 5 months ago by nickm

Resolution: fixed
Status: merge_readyclosed

Merging!

Note: See TracTickets for help on using tickets.