Opened 12 months ago

Closed 5 months ago

Last modified 5 months ago

#20909 closed defect (fixed)

Tor 0.2.9.5-alpha still delivers outdated consensuses

Reported by: teor Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.2.9.5-alpha
Severity: Normal Keywords: tor-relay needs-analysis maybe-it-went-away-when-we-werent-looking
Cc: Actual Points:
Parent ID: Points: 1
Reviewer: Sponsor:

Description

I found one relay on 0.2.9.5-alpha that still has this bug, after scanning about ~400 relays.

WARNING::Consensus download: 1.8s outdated consensus, expired 44855s ago from snowfall (192.160.102.164:80) 823AA81E277F366505545522CEDC2F529CE4DC3F, max download time 15.0s.

https://atlas.torproject.org/#details/823AA81E277F366505545522CEDC2F529CE4DC3F

The microdesc consensus is ~12 hours out of date:
http://192.160.102.164/tor/status-vote/current/consensus-microdesc

The full consensus is fine:
http://192.160.102.164/tor/status-vote/current/consensus

It was fine yesterday when I scanned it.

Child Tickets

Change History (21)

comment:1 Changed 12 months ago by teor

I reopened #20501 to get another full network scan, and emailed the operator asking them for their logs.

comment:2 Changed 12 months ago by teor

This relay is another instance of this regression, it is on 0.2.9.6-rc:

WARNING::Consensus download: 2.6s outdated consensus, expired 1586s ago from torrelay1ph3xat (86.59.119.83:80) FC9AC8EA0160D88BCCFDE066940D7DD9FA45495B, max download time 15.0s.

The microdesc consensus is outdated, the full consensus is fine.
http://86.59.119.83/tor/status-vote/current/consensus-microdesc

I've emailed the operator asking for logs.

comment:3 Changed 12 months ago by teor

And one more, on 0.2.9.6-rc:

WARNING::Consensus download: 2.2s outdated consensus, expired 116s ago from NYCBUG0 (66.111.2.20:9030) 9A68B85A02318F4E7E87F2828039FBD5D75B0142, max download time 15.0s.

I emailed the operator.

I am wondering if some of our fixes in 0.3.0 solved this issue - particularly those in #20667. Or perhaps there just aren't that many relays on 0.3.0.

I am also wondering what happened on the network ~12 hours ago.
There weren't any yesterday, and now suddenly there are 3.
Do they reset themselves every day???

comment:4 Changed 12 months ago by teor

Here are my questions on this issue:

Are we delaying connections the way we expect?

I did some calculations for the expected exponential delay growth in #20534.
We should check we are actually getting an exponential distribution in practice.
We need some debug logging around the exponential backoff functions.

Is the exponential backoff maximum delay too high?

If the network goes down for a period of time N seconds long, and then comes back up again, the backoff delay D will grow until D >= N. In fact, if the exponent is E, the backoff will be of the order of N^E. Until it hits the maximum.

There's no network-driven reason for using INT_MAX as the maximum. I think we should define a maximum for each download schedule, which is a tradeoff between:

If my network connection goes down, how long should I stay down after it comes up again?

The answer for relay consensuses could be 3 hours, otherwise they expire. (Or perhaps 24, the reasonably live consensus period.)

Clients can afford to wait longer, as long as they come up when a request is made.
(But hidden services should come up automatically after a certain amount of time.)
Is 24 hours ok for a hidden service to be down?

If the Tor network is down or overloaded, how long do I need to wait to avoid making it worse?

This really depends on how many clients there are. (And how many relays. And how many fallback directories they try.)

I think 3 hours is too short. But maybe it's ok to have it just for consensuses?
Perhaps 24 hours is safe as a general limit?

comment:5 Changed 12 months ago by nickm

I'm wondering if this is an intermittent failure or a semipermanent thing. If it's only intermittent, we don't need to add these to #20509. But if it's going to be that way all the time, we need to to reconsider which versions #20509 hates.

comment:6 Changed 12 months ago by nickm

Priority: MediumHigh

comment:7 Changed 12 months ago by teor

Those relays have recovered since yesterday.
That's good, but serving a stale consensus for ~12-24 hours is still a bad thing.

Running a full check in #20501 will tell us the prevalence of this issue.

Some of our 0.3.0 changes might fix this:

  • using microdesc consensuses by default (#6769).

I suggest we bring this change forward to 029 as an extra precaution:

  • 404 when consensus is out of date (#20511)

comment:8 Changed 12 months ago by teor

Today's crop:

0.2.9.6-rc

WARNING::Consensus download: 2.2s outdated consensus, expired 1794s ago from Quintex10 (199.249.223.71:80) B6320E44A230302C7BF9319E67597A9B87882241, max download time 15.0s.
WARNING::Consensus download: 2.6s outdated consensus, expired 952s ago from torrelay1ph3xat (86.59.119.83:80) FC9AC8EA0160D88BCCFDE066940D7DD9FA45495B, max download time 15.0s.

The oldest I've seen so far was ~13 hours, none seems to have made it past 24 hours yet.
And it's pretty rare, about 2-3% of the relays I'm testing.

comment:9 Changed 12 months ago by teor

(I will do another scan of ~200 relays for this over the weekend, when I create the draft fallback list in #18828.)

comment:10 Changed 12 months ago by teor

The harvest today, one new one, one repeat customer:

WARNING::Consensus download: 4.4s outdated consensus, expired 1120s ago from eriador (91.121.84.137:4951) 6DE61A6F72C1E5418A66BFED80DFB63E4C77668F, max download time 15.0s.
WARNING::Consensus download: 2.7s outdated consensus, expired 8323s ago from Quintex10 (199.249.223.71:80) B6320E44A230302C7BF9319E67597A9B87882241, max download time 15.0s.

All still seem to be recovering within 24 hours.
This still represents about 1% of relays scanned.

comment:11 Changed 12 months ago by teor

Keywords: regression must-fix-before-029-stable removed
Milestone: Tor: 0.2.9.x-finalTor: 0.3.0.x-final
Priority: HighMedium

In #20501, atagar did a scan of 5053 directory mirrors with DirPorts (stem doesn't support ORPort begindir yet). Directory mirrors that did not respond appear to have been excluded.

Here are the results:

  • 41 (0.81%) serve an expired consensus
    • 3 (0.06%) are on 0.2.7 (0.2.7.6)
      • 2x2 days and 8 days after expiry
    • 2 (0.04%) are on 0.2.8 (0.2.8.9 and 0.2.8.0-alpha-dev)
      • 2x2 days after expiry
    • 36 (0.71%) are on 0.2.9
      • 29 (0.57%) are on <= 0.2.9.4-alpha-dev
        • 5 on 0.2.9.2-alpha
          • 2 days, 2 months, 3 months, and 2x4 months after expiry
        • 7 on 0.2.9.3-alpha
          • 3x2 days, 1 month, 3x2 months after expiry
        • 1 on 0.2.9.3-alpha-dev
          • 2 months after expiry
        • 16 on 0.2.9.4-alpha
          • 3x2 days, 2x1 week, 2x2 weeks, 9x1 month after expiry
      • 7 (0.14%) are on >= 0.2.9.5-alpha
        • 6 on 0.2.9.5-alpha
          • 2x1 day, 4x2 days after expiry
        • 1 on 0.2.9.6-rc
          • 1 day after expiry
    • none are on 0.3.0.0-alpha-dev
      • I wonder whether this bug is fixed there, or whether this version is too rare to register?

So we've definitely fixed this bug on 0.2.9.5-alpha and later:

  • the expiry times are comparable to 0.2.7 and 0.2.8,
  • the prevalence of the bug is significantly higher than 0.2.7 and 0.2.8, but less than 0.2.9.0 to 0.2.9.4-alpha.

The current expiry times are workable for clients: they use consensuses up to a day old. (And will retry their other directory guards (3 in total) if the consensus they receive is old.)

So this is neither a regression (it happened in 0.2.8 and 0.2.7) nor a must-fix for 0.2.9.

Still, I think it would be nice to fix the schedule maximums in 0.3.0, based on:
https://trac.torproject.org/projects/tor/ticket/20909#comment:4

Moving to needs-infomation until we work out how old we want directory mirror consensuses to get, versus how much slow-zombie retry behaviour we want to tolerate from directory mirrors.

comment:12 Changed 12 months ago by nickm

Status: newneeds_information

Moving to needs-infomation

Actually moving to needs-information ;)

comment:13 Changed 11 months ago by nickm

Milestone: Tor: 0.3.0.x-finalTor: unspecified

comment:14 Changed 7 months ago by arma

Now that a bunch of time has passed, can somebody (teor or atagar or maybe you, the nice volunteer reading this) do another measurement run, to see how things stand here?

The stem script is attached to #20501.

(But since Tor 0.3.x relays have the #20511 feature, we will probably want to update the stem script to notice relays that give a 404 response, since maybe that means they have an expired consensus and are just choosing not to give it to us.)

comment:15 in reply to:  14 ; Changed 7 months ago by teor

Replying to arma:

Now that a bunch of time has passed, can somebody (teor or atagar or maybe you, the nice volunteer reading this) do another measurement run, to see how things stand here?

When I rebuilt the fallback list last week, there were no expired consensuses. I used to get several every time I ran the script.

https://trac.torproject.org/projects/tor/attachment/ticket/21564/fallbacks_2017-05-16-0815-09cd78886.log

The stem script is attached to #20501.

(But since Tor 0.3.x relays have the #20511 feature, we will probably want to update the stem script to notice relays that give a 404 response, since maybe that means they have an expired consensus and are just choosing not to give it to us.)

There were no 404 errors either.

Since the fallbacks are just a sample of the high-uptime, high-bandwidth relays in the network, we should check the entire network to be sure.

comment:16 in reply to:  15 Changed 7 months ago by arma

Replying to teor:

Since the fallbacks are just a sample of the high-uptime, high-bandwidth relays in the network, we should check the entire network to be sure.

Agreed!

comment:17 Changed 5 months ago by nickm

Keywords: tor-relay needs-analysis maybe-it-went-away-when-we-werent-looking added

comment:18 Changed 5 months ago by teor

Atagar (or anyone else), did you want to re-run those checks, or should we close this?

comment:19 Changed 5 months ago by atagar

Hi teor, no problem. Kicking off a run of the script mentioned in the other ticket. Last time it took 639 minutes to run so I won't have results until tomorrow at the earliest.

comment:20 Changed 5 months ago by atagar

Hi teor, script finished after 511 minutes. Results are...

% grep -v currrent /tmp/results.txt        
2F8153E85A628D20416691082D58F95676719C14 (0.2.9.11): consensus expired at 2017-07-09T18:00:00
A3F68B3413BD4C83B7315B550AE84BABEB0F0CAF (0.2.9.10): consensus expired at 2017-06-26T12:00:00
7552AE46D6271D22B2EF0B12C96BA075FC0DC573 (0.2.7.6): consensus expired at 2017-07-10T08:00:00
5A044604030A3B0C45C5D1A96C8096026A9C1766 (0.2.5.14): failed (timed out)
4518985BA8BD859EE3C46D51CF07E9C14BD5BC17 (0.3.0.9): consensus expired at 2017-07-10T01:00:00
3FFCDAAC784C9FB66E502490A787B5B8AFC69E28 (0.3.1.4-alpha): consensus expired at 2017-07-10T14:00:00
06D0FB9C6860E8D7FB99EA310A016707AFAA71CE (0.2.9.11): consensus expired at 2017-07-06T08:00:00

comment:21 Changed 5 months ago by teor

Resolution: fixed
Status: needs_informationclosed

Ok, that seems reasonable: 0.2.9, 0.3.0 and 0.3.1 seem no worse than 0.2.7 and 0.2.5, and the overall incidence is very low (~0.1% of relays).

It is unlikely any client would choose 3 of these relays as directory guards (~1/109).

Last edited 5 months ago by teor (previous) (diff)
Note: See TracTickets for help on using tickets.