EARLY_CONSENSUS_NOTICE_SKEW of 60 is too strict for some drifting dirauth clocks

changed milestone to %Tor: 0.3.4.x-final

added 034-roadmap-proposed clock-skew component::core tor/tor milestone::Tor: 0.3.4.x-final owner::catalyst priority::medium reporter::Dbryrtfbcbhgf resolution::fixed reviewer::isis s8-errors severity::normal sponsor::8-can status::closed type::defect version::tor 0.2.2.25-alpha labels

Here is the relay

https://metrics.torproject.org/rs.html#details/5D86AFD7CE409251E67B373B4F0E780A0F41C944 I also made sure the system time was synced. Synchronized to time server 91.189.91.157:123 (ntp.ubuntu.com).

Trac:
Username: Dbryrtfbcbhgf

The clock skew tolerance should probably be larger than 61 seconds so this might be an actual bug. I think we could use a little more information though. Is the skew the same amount each time you get that warning, or does it change?

What output do you get from ntpq -n -c sysinfo -c peers?

Trac:
Status: new to needs_information
Milestone: N/A to Tor: unspecified
Keywords: N/A deleted, clock-skew, s8-errors added
Version: N/A to Tor: 0.3.2.10
Sponsor: N/A to Sponsor8-can

See also this thread on tor-relays: https://lists.torproject.org/pipermail/tor-relays/2018-February/014593.html

Do we have a log message around there that says where we got the consensus from?

Since this is a dir mirror, we got it from an authority, right?

There is a tiny part of me that wonders if this is because dizum's clock is 65 seconds early.

But...it can't be just that, right? Since this relay is seeing a consensus that was made in the future, and that means this relay's clock is set far enough in the past that all the dir auths made a consensus and timestamped it and made it available yet it was still in the future from this relay's perspective.

Ok, here's our hint: both this ticket and the tor-relays thread had this happen at 59 minutes after the hour.

Scenario: dizum is 65 seconds early. So it votes early, and sends a signature early, and most importantly, it makes the new consensus available early.

So if the relay here has the correct time, and it happens to ask for a consensus at the 59 minute mark, then dizum has already switched over to handing out the new consensus, and most importantly, it sticks a timestamp on the new consensus that says it came from the top of the hour. And while all the other dir auths have accurate clocks, they send their signatures early (5 minutes early) for robustness. So it doesn't matter whether they have accurate clocks, dizum can by itself produce a consensus with a timestamp in the future that their signatures on it, and it can do this theoretically as soon as it has enough signatures from dir auths for that round -- i.e. 5 minutes early if it wanted to drift more.

Ok, here's a plan:

I'm going to file a consensus-health ticket and a doctor ticket, to have us check the dir auths for clock skew, to get earlier notice when they go wrong.
I'm going to contact dizum's operator and get him to fix it one more time.
I continue to think we should change the relay consensus fetching algorithm to wait a little while when it rolls the dice and they come up between :55 and :00. Dgoulet says he has a diagram, in his little book, of when each role fetches the consensus. We should get that transcribed into dir-spec, and then build a plan for this third item.

Replying to arma:

I'm going to file a consensus-health ticket and a doctor ticket

Done: #25767 (moved) and #25768 (moved).

I'm going to contact dizum's operator and get him to fix it one more time.

Also done (well, initiated).

I continue to think we should change the relay consensus fetching algorithm to wait a little while when it rolls the dice and they come up between :55 and :00. Dgoulet says he has a diagram, in his little book, of when each role fetches the consensus. We should get that transcribed into dir-spec, and then build a plan for this third item.

Pending on dgoulet to send us the pics.

Replying to catalyst:

The clock skew tolerance should probably be larger than 61 seconds so this might be an actual bug. I think we could use a little more information though. Is the skew the same amount each time you get that warning, or does it change?

What output do you get from ntpq -n -c sysinfo -c peers? associd=0 status=c016 leap_alarm, sync_unspec, 1 event, restart, system peer: 0.0.0.0:0 system peer mode: unspec leap indicator: 11 stratum: 16 log2 precision: -24 root delay: 0.000 root dispersion: 0.045 reference ID: INIT reference time: 00000000.00000000 Thu, Feb 7 2036 8:28:16.000 system jitter: 0.000000 clock jitter: 0.000 clock wander: 0.000 broadcast delay: 0.000 symm. auth. delay: 0.000 remote refid st t when poll reach delay offset jitter ============================================================================== 0.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.000 1.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.000 2.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.000 3.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.000 ntp.ubuntu.com .POOL. 16 p - 64 0 0.000 0.000 0.000

Trac:
Username: Dbryrtfbcbhgf

Here are all my logs on the relay. https://trac.torproject.org/projects/tor/attachment/ticket/25756/all%20logs.zip

Trac:
Username: Dbryrtfbcbhgf

all_logs.zip

Trac:
Username: Dbryrtfbcbhgf

Replying to Dbryrtfbcbhgf:

Replying to catalyst:

What output do you get from ntpq -n -c sysinfo -c peers? {{{ associd=0 status=c016 leap_alarm, sync_unspec, 1 event, restart, system peer: 0.0.0.0:0 system peer mode: unspec }}} Thanks. I'm pretty sure the above means the ntpd is not synchronized; possibly it has just restarted? {{{ leap indicator: 11 stratum: 16 log2 precision: -24 root delay: 0.000 root dispersion: 0.045 reference ID: INIT reference time: 00000000.00000000 Thu, Feb 7 2036 8:28:16.000 system jitter: 0.000000 clock jitter: 0.000 clock wander: 0.000 broadcast delay: 0.000 symm. auth. delay: 0.000 remote refid st t when poll reach delay offset jitter ============================================================================== 0.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.000 1.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.000 2.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.000 3.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.000 ntp.ubuntu.com .POOL. 16 p - 64 0 0.000 0.000 0.000 }}} I think the other output is also consistent with the ntpd having just restarted.

Anyway, as arma commented above, the more likely problem seems to be a dirauth having an inaccurate clock.

Trac:
Status: needs_information to new

Replying to catalyst:

Replying to Dbryrtfbcbhgf:

Replying to catalyst:

What output do you get from ntpq -n -c sysinfo -c peers?

Thanks. I'm pretty sure the above means the ntpd is not synchronized; possibly it has just restarted?

I think the other output is also consistent with the ntpd having just restarted.

Anyway, as arma commented above, the more likely problem seems to be a dirauth having an inaccurate clock.

Is there anything I should do on my end? manually sync ntpd? Like

sudo service ntp stop
sudo ntpd -gq
sudo service ntp start

source https://askubuntu.com/a/254846 Update I ran the above command here is the result.

associd=0 status=0614 leap_none, sync_ntp, 1 event, freq_mode,
system peer:        193.27.208.100:123
system peer mode:   client
leap indicator:     00
stratum:            2
log2 precision:     -24
root delay:         70.729
root dispersion:    94.791
reference ID:       193.27.208.100
reference time:     de77aaf3.34d27925  Wed, Apr 11 2018  0:28:19.206
system jitter:      16.162191
clock jitter:       23.622
clock wander:       0.000
broadcast delay:    0.000
symm. auth. delay:  0.000
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 0.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000   0.000
 1.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000   0.000
 2.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000   0.000
 3.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000   0.000
 ntp.ubuntu.com  .POOL.          16 p    -   64    0    0.000    0.000   0.000
+46.227.200.70   90.155.74.41     3 u   17   64    3   53.618  -15.397  31.920
#178.17.160.12   162.23.41.56     2 u   19   64    3    1.972   37.222  33.064
+54.36.60.132    131.188.3.223    2 u   21   64    3   45.507  -14.822  34.655
#178.17.162.12   162.23.41.56     2 u   10   64    1   12.277   67.038   5.234
#2a00:1dc0::12   162.23.41.56     2 u   17   64    3    2.260   37.170  31.020
+78.46.93.106    126.94.231.148   2 u   14   64    3   38.454  -16.909  29.981
+193.6.222.47    121.131.112.137  2 u   12   64    3   14.051  -17.925  29.071
 217.26.163.51   129.6.15.30      2 u   17   64    3   33.883  -32.016  32.788
#2a00:1dc0:2::12 162.23.41.56     2 u   13   64    3    2.038   62.084  30.355
+51.174.131.248  129.242.4.241    2 u   13   64    3   56.278  -16.001  29.010
#178.17.161.12   162.23.41.56     2 u   18   64    3    1.893   29.749  33.004
#2a00:1dc0:1::12 162.23.41.56     2 u   15   64    3    2.032   29.540  31.144
+195.50.171.101  145.253.3.52     2 u   14   64    3   36.770  -16.139  29.035
+195.201.19.162  46.177.190.229   3 u   14   64    3   36.884  -15.937  28.842
*193.27.208.100  .PPS.            1 u   11   64    3   70.729   -4.192  26.977
+94.247.111.10   46.254.241.74    2 u   12   64    3  118.740  -10.962  26.170
+198.60.22.240   150.143.81.69    2 u   12   64    3  166.668  -17.647  29.310

Trac:
Username: Dbryrtfbcbhgf

Update ticket summary to better reflect the actual problem.

Trac:
Summary: I keep getting this error on my relay to EARLY_CONSENSUS_NOTICE_SKEW of 60 is too strict for some drifting dirauth clocks

During this week's meeting, we decided it would be a good idea to relax this test to account for the voting schedule. That way for a client or relay to get a "consensus is coming from the future" warning, enough dirauths would have to have their clocks skewed by about the same amount. A single dirauth with an early clock shouldn't be able to induce this warning by releasing a consensus early. (This is what happened with dizum.)

arma says the voting-delay consensus parameter is something we want to look at if we want to not hard code this.

Trac:
Keywords: N/A deleted, 034-roadmap-proposed added

Trac:
Status: new to assigned
Milestone: Tor: unspecified to Tor: 0.3.4.x-final
Owner: N/A to catalyst

https://github.com/tlyu/tor/tree/bug25756 contains some WIP preparatory refactoring and new tests.

EARLY_CONSENSUS_NOTICE_SKEW of 60 is too strict for some drifting dirauth clocks

Child items ...

Activity