Opened 7 years ago

Closed 3 years ago

Last modified 3 years ago

#8625 closed defect (implemented)

Do not call networkstatus_reset_download_failures() hourly

Reported by: nickm Owned by: andrea
Priority: High Milestone: Tor: 0.2.9.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-client, tor-dos, TorCoreTeam201608, review-group-6
Cc: Actual Points: 3
Parent ID: Points: 3
Reviewer: nickm Sponsor: SponsorU-can

Description

Once an hour, we call router_reset_descriptor_download_failures, which calls networkstatus_reset_download_failures(). This probably isn't right! Our download logic has a way to say "Try this once an hour": that's by having a download schedule with a one-hour interval. What this does is make us retry several times an hour, as if we had never failed before.

Found while investigating #4580 .

Child Tickets

Change History (31)

comment:1 Changed 7 years ago by nickm

(I've put this in 0.2.5 with an 024-backport tag since it's likely to take a little tuning, and prone to break something unexpected.)

comment:2 Changed 6 years ago by nickm

Milestone: Tor: 0.2.5.x-finalTor: 0.2.6.x-final

comment:3 Changed 5 years ago by nickm

Keywords: 025-backport 026-triaged-1 026-deferrable added; 024-backport removed

comment:4 Changed 5 years ago by nickm

Milestone: Tor: 0.2.6.x-finalTor: 0.2.???

Defer some items from 0.2.6. They are mostly worth doing, but not going to happen super-fast.

comment:5 Changed 4 years ago by nickm

Milestone: Tor: 0.2.???Tor: 0.2.8.x-final

comment:6 Changed 4 years ago by nickm

Points: large

comment:7 Changed 4 years ago by nickm

Priority: MediumHigh
Severity: Normal

comment:8 Changed 4 years ago by teor

#4483 makes consensus downloads more reliable by opening more connections.
These extra connections make this issue more important to fix, and it also means we don't need to retry so often any more.
We should try to get this in 0.2.8, but I'm not sure I can do it.

comment:9 Changed 4 years ago by nickm

Priority: HighVery High

comment:10 Changed 4 years ago by nickm

Milestone: Tor: 0.2.8.x-finalTor: 0.2.9.x-final
Sponsor: SponsorU-can

comment:11 Changed 4 years ago by nickm

Priority: Very HighHigh

comment:12 Changed 4 years ago by nickm

Keywords: tor-dos added

comment:13 Changed 4 years ago by isabela

Points: large6

comment:14 Changed 3 years ago by nickm

Points: 6parent

comment:15 Changed 3 years ago by nickm

Points: parent3

Actually, this is easier than I'd thought; this function does less than I feared. I'm still leaving in an extra day for debug time though

comment:16 Changed 3 years ago by andrea

Owner: set to andrea
Status: newassigned

Taking ownership for 0.2.9 triage

comment:17 Changed 3 years ago by nickm

Keywords: TorCoreTeam201607 added

comment:18 Changed 3 years ago by andrea

This one turns out to be easy to just disable the callback in main.c, and the result bootstraps and works correctly when the network is operating normally, but testing in situations where some failures and scheduled retries actually happen is more challenging; I could parse a list of all (IP:dirport) pairs out of the consensus for directory mirrors and use iptables to make all direct download attempts fail, but this will be incomplete given the ability to tunnel directory requests once bootstrapped. Possibly this needs the introduction of a new config option to simulate failure of all download requests when enabled.

comment:19 Changed 3 years ago by andrea

The if (!options->FetchServerDescriptors) test in directory_get_from_dirserver() looks promising, but doesn't actually do the thing we need because it prevents new requests from being launched too early for download_status_increment_failure() to ever be called. I'm adding a new SimulateDirDownloadFailures config boolean for this.

comment:20 Changed 3 years ago by andrea

I think the right place to do this options test and simulated connection failure is in directory_send_command() right before the connection_write_to_buf() stuff starts; the download failure handling in connection_dir_about_to_close() et seq. depends on having the conn->requested_resource field set correctly.

comment:21 Changed 3 years ago by andrea

Status: assignedneeds_review

Proposed fix in my bug8625 branch; testing still in progress.

comment:22 Changed 3 years ago by andrea

Actual Points: 3

comment:23 Changed 3 years ago by nickm

Reviewer: nickm

comment:25 Changed 3 years ago by nickm

Actual-review-points-so-far: 0

At first glance looks okay, modulo small comments.

Testing will be important. The important thing to consider is whether there are any weird cases where we are assuming that failures get reset often and we aren't resetting them as often as we should. But if you don't find any of those in testing, I say we merge and wait to see if we get bugs reported? Even if there are no actual bugs here, I expect that doing this will force us to re-tune some of our retry parameters a little.

comment:26 Changed 3 years ago by nickm

Keywords: review-group-6 added

comment:27 Changed 3 years ago by nickm

Status: needs_reviewneeds_revision

comment:28 Changed 3 years ago by andrea

Status: needs_revisionneeds_review

See bug8625_prod branch, rebased and with SimulateDirDownloadFailures removed.

comment:29 Changed 3 years ago by nickm

Keywords: 025-backport 026-triaged-1 026-deferrable removed
Resolution: implemented
Status: needs_reviewclosed

lgtm, merged! Let's keep in mind that we merged this if any download-related bugs crop up.

comment:30 Changed 3 years ago by nickm

Keywords: TorCoreTeam201608 added; TorCoreTeam201607 removed

comment:31 Changed 3 years ago by arma

For posterity: #20499 and its children were the bugs that cropped up.

Note: See TracTickets for help on using tickets.