Opened 22 months ago

Closed 7 weeks ago

#20534 closed enhancement (wontfix)

Revise hard-coded download schedules

Reported by: teor Owned by:
Priority: Low Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.2.9.1-alpha
Severity: Normal Keywords: regression, triaged-out-20170124, 034-triage-20180328, 034-removed-20180328
Cc: catalyst Actual Points:
Parent ID: Points: 0.5
Reviewer: Sponsor:

Description

We should tweak the download schedules in config.c based on what we've learned in #20499.

These schedules should retry sooner than never:
TestingServerDownloadSchedule
TestingClientDownloadSchedule

These schedules retry at most every 2 hours, should that be higher?
TestingServerConsensusDownloadSchedule

These schedules retry at most every 12 hours, should that be higher? lower?
TestingClientConsensusDownloadSchedule

These schedules retry at most every 73 hours, should that be lower?
Should we try more times before jumping to retrying after an hour?
ClientBootstrapConsensusAuthorityDownloadSchedule
ClientBootstrapConsensusFallbackDownloadSchedule
ClientBootstrapConsensusAuthorityOnlyDownloadSchedule

Should we try more than 7 or 8 times to get directory documents?
TestingConsensusMaxDownloadTries
ClientBootstrapConsensusMaxDownloadTries
TestingDescriptorMaxDownloadTries
TestingMicrodescMaxDownloadTries
TestingCertMaxDownloadTries

Child Tickets

TicketTypeStatusOwnerSummary
#20499defectclosednickmA running Tor won't update the microdesc consensus
#20604enhancementclosedAllow exponential backoff to configure a once-off initial delay
#20605enhancementclosedReduce the exponential backoff variance
#20606enhancementclosedMake the test network exponential backoff multiplier configurable
#20607enhancementclosedRevise chutney download schedules for exponential backoff

Attachments (1)

Exponential Backoff.xlsx (33.3 KB) - added by teor 21 months ago.
Spreadsheet used to calculate averages and ranges (sorry, Excel)

Download all attachments as: .zip

Change History (37)

comment:1 Changed 22 months ago by teor

Parent ID: #20499

comment:2 Changed 22 months ago by teor

Some of these schedules also affect #19969

comment:3 Changed 22 months ago by teor

Actually, there's not much point in revising these schedules - the exponential backoff code only pays attention to the first and last value in the schedule.

comment:4 in reply to:  3 ; Changed 22 months ago by dgoulet

Status: newneeds_information

Replying to teor:

Actually, there's not much point in revising these schedules - the exponential backoff code only pays attention to the first and last value in the schedule.

teor, should we keep this in 029 then? Or the ticket at all?

comment:5 in reply to:  4 Changed 22 months ago by teor

Replying to dgoulet:

Replying to teor:

Actually, there's not much point in revising these schedules - the exponential backoff code only pays attention to the first and last value in the schedule.

teor, should we keep this in 029 then? Or the ticket at all?

If we revert the exponential backoff code as our solution to #20499, we will want to tweak the final times on some of these schedules, to achieve Roger's goal of "never have a relay stop trying entirely".

If we don't, we will want to adjust the initial time as well.

So I think my comment was a bit hasty - it's only the middle times that don't matter, and only if we keep exponential backoff.

comment:6 Changed 22 months ago by nickm

Status: needs_informationneeds_review

I think that the answer here is just to remove the final times entirely, or make them not count when the schedule is exponential. I've done this as part of my branch 20499_part1_029.

comment:7 Changed 21 months ago by teor

I've reviewed 20499_part1_029 over in #20499.

comment:8 Changed 21 months ago by teor

Here are the schedules we used to use before exponential backoff was implemented:

TestingServerDownloadSchedule "0, 0, 0, 60, 60, 120, 300, 900, 2147483647"
TestingClientDownloadSchedule "0, 0, 60, 300, 600, 2147483647"
TestingServerConsensusDownloadSchedule "0, 0, 60, 300, 600, 1800, 1800, 1800, 1800, 1800, 3600, 7200"
TestingClientConsensusDownloadSchedule "0, 0, 60, 300, 600, 1800, 3600, 3600, 3600, 10800, 21600, 43200"
ClientBootstrapConsensusFallbackDownloadSchedule "0, 1, 4, 11, 3600, 10800, 25200, 54000, 111600, 262800"
ClientBootstrapConsensusAuthorityOnlyDownloadSchedule "0, 3, 7, 3600, 10800, 25200, 54000, 111600, 262800"
ClientBootstrapConsensusAuthorityDownloadSchedule "10, 11, 3600, 10800, 25200, 54000, 111600, 262800"
TestingBridgeDownloadSchedule "3600, 900, 900, 3600"

And here are the average exponential backoff attempt times for each unique starting point above:

0, 1, 2, 3.5, 5.5, 8.5, 13, 19.5, 29, 43.5, 65.5, ...
10, 15, 22.5, 34, 51, 76.5, ...
3600, 5400, 8100, ...

This means:

  • In 0.2.9, almost every tor instance will try to download almost every document 11 times in the first minute. In 0.2.8, this was 3-4 times in the first minute.
  • In 0.2.9, clients will try authorities 5 times in the first minute. In 0.2.8, this was 2 times in the first minute.
  • In 0.2.9, bridge clients will try to download bridge descriptors 2 times in the first 3 hours. In 0.2.8, this was 4 times in the first 3 hours.

We can't fix this by modifying the minimum times. But we might be able to fix it by modifying the exponent. Or providing a failure count at which we increase the delay to hourly, rather than slowly increasing it in an exponential fashion.

Here's what it would look like if the exponent were 3 (max = 5*delay) rather than 1.5 (max = 2*delay), and we adjusted the client to authority and bridge descriptor start times:

0, 1, 3, 9, 27, 81, ...
6, 18, 54, ...
1200, 3600, 10800, ...

This would mean:

  • In 0.2.9, almost every tor instance would try to download almost every document 5 times in the first minute. In 0.2.8, this was 3-4 times in the first minute.
  • In 0.2.9, clients would try authorities 3 times in the first minute (and twice in the first 24 seconds). In 0.2.8, this was 2 times in the first minute (and twice in the first 21 seconds).
  • In 0.2.9, bridge clients will try to download bridge descriptors 3 times in the first 3 hours (the first time after 20 minutes). In 0.2.8, this was 4 times in the first 3 hours (the first time after 1 hour).

comment:9 Changed 21 months ago by teor

Keywords: CoreTorTeam201611 added
Status: needs_reviewnew
Version: Tor: 0.2.9.1-alpha

So, in summary, here are the 3 changes I think resolve #20499, by making the exponential backoff schedules much more like the 0.2.8 schedules:

  • adjust max_increment to be delay*5, which makes the exponent (1+5)/2 = 3
  • make the first entry in ClientBootstrapConsensusAuthorityDownloadSchedule 6
  • make the first entry in TestingBridgeDownloadSchedule 1200

comment:10 Changed 21 months ago by nickm

Okay, I've taken a look here in my branch bug20534_029. Is that what you had in mind? I've not been able to convince myself 5x is safe, so I went with 4x. Let's see if that works out.

comment:11 Changed 21 months ago by nickm

Status: newneeds_review

comment:12 in reply to:  10 Changed 21 months ago by teor

Status: needs_reviewmerge_ready

Replying to nickm:

Okay, I've taken a look here in my branch bug20534_029. Is that what you had in mind? I've not been able to convince myself 5x is safe, so I went with 4x. Let's see if that works out.

That's fine, I was concerned 5x would lead to too much variance.

One nitpick: no more than quadruple. is wrong, it is no more than quintuple., because the increment is added to the existing delay.

Let me just do my sums for exponent 2.5 (max = 4*delay):

0, 1, 2.5, 6.3, 15.6, 39.1, 97.7, ...
6, 15, 37.5, 93.75, ...
1200, 3000, 7500, 18750, ...

This would mean:

  • In 0.2.9, almost every tor instance would try to download almost every document 6 times in the first minute. In 0.2.8, this was 3-4 times in the first minute.
  • In 0.2.9, clients would try authorities 3 times in the first minute (and twice in the first 21 seconds). In 0.2.8, this was 2 times in the first minute (and twice in the first 21 seconds).
  • In 0.2.9, bridge clients will try to download bridge descriptors 3 times in the first 3 hours (the first time after 20 minutes). In 0.2.8, this was 4 times in the first 3 hours (the first time after 1 hour).

I think we should document this somewhere, but that should be a separate task.
(I'm sure arma would like to read this analysis, eventually.)

comment:13 Changed 21 months ago by nickm

I'll fix and merge then, and leave the ticket open for a documentation spree.

comment:14 Changed 21 months ago by teor

Here's a relevant comment from the 0.2.8 #4483 implementation in config.c. We can contrast it with the 0.2.9 #15942 implementation:

 /* With the ClientBootstrapConsensus*Download* below:
   * Clients with only authorities will try:
   *  - 3 authorities over 10 seconds, then wait 60 minutes.
   * Clients with authorities and fallbacks will try:
   *  - 2 authorities and 4 fallbacks over 21 seconds, then wait 60 minutes.
   * Clients will also retry when an application request arrives.
   * After a number of failed reqests, clients retry every 3 days + 1 hour.
   *
   * Clients used to try 2 authorities over 10 seconds, then wait for
   * 60 minutes or an application request.
   *
   * When clients have authorities and fallbacks available, they use these
   * schedules: (we stagger the times to avoid thundering herds) */

comment:15 Changed 21 months ago by teor

Also, authority-only clients will try 6 authorities in the first minute. (5 in the first 30 seconds, 4 in the first 10 seconds). This isn't ideal, but it's also not the default, so I don't think it matters that much.

One way to fix this would be to make ClientBootstrapConsensusAuthorityOnlyDownloadSchedule start with 1 or 2 or 3, rather than 0. (What I really want is a way to say: "initial delay, then exponential on this other delay after that".)

comment:16 Changed 21 months ago by teor

Here's how the latest version works, with the next delay in [1, delay*5] (multiplier 4), except when delay is 0, when the next delay is in [1,2]. This time, I'm using ranges and averages:

0, 1.5 [1-2], 4.3 [2-10], 11 [3-50], 28 [4-250], 71 [5-1250], ...
6, 19 [7-30], 47 [8-150], 117 [9-750], 294 [10-3750], ...
1200, 3601 [1201-6000], 9002 [1202-30000], 22505 [1203-150000], ...

Ok, so that multiplier is going to be terrible (in rare cases) for clients, let's try delay*4 (multiplier 3):

0, 1.5 [1-2], 5 [2-8], 11 [3-32], 22 [4-128], 44 [5-256], 88 [6-512], ...
6, 16 [7-24], 32 [8-96], 64 [9-384], 128 [10-1536], ...
1200, 3001 [1201-4800], 6002 [1202-19200], 12004 [1203-76800], ...

And delay*3 (multiplier 2):

0, 1.5 [1-2], 4 [2-6], 6.5 [3-18], 10 [4-54], 16 [5-162], 24 [6-486], 37 [7-1458], 56 [8-4374], 85 [9-13122], ...
6, 13 [7-18], 19 [8-54], 29 [9-162], 45 [10-486], ...
1200, 2400 [1201-3600], 3601 [1202-10800], 5402 [1203-32400], ...

So a multiplier of 2 is too fast, 3 seems just about right.

comment:17 Changed 21 months ago by teor

I wanted a multiplier of 2.5, but we settled on 3, because, integers.

In 0.3.0 or 0.3.1, I think we should consider having the next delay in [delay*2, delay*3], rather than [delay, delay*3]. This would increase the average and lower bound, and decrease the upper bound (I don't like the range in the current model). The initial values might need some tweaking after this change.

comment:18 Changed 21 months ago by nickm

Milestone: Tor: 0.2.9.x-finalTor: 0.3.0.x-final
Owner: set to nickm
Status: merge_readyaccepted

Merged, leaving open for documentation

comment:19 Changed 21 months ago by teor

Turns out that sometimes I just can't do my sums right.
Here are the actual figures for each case with exponent 4 (multiplier 3):

Initial delay 0: (most schedules)

Attempt / Failure	Min Increment	Max Increment	Average	Minimum	Maximum
1	0	0	0	0	0
2	1	2	2	1	2
3	1	8	5	2	10
4	1	32	13	3	42
5	1	128	33	4	170
6	1	512	84	5	682
7	1	2048	211	6	2730

Initial delay 6: (client bootstrap from authorities)

Attempt / Failure	Min Increment	Max Increment	Average	Minimum	Maximum
1	0	0	6	6	6
2	1	20	11	7	26
3	1	80	27	8	106
4	1	320	69	9	426
5	1	1280	174	10	1706

Initial delay 1200: (bridge client bridge descriptors)

Attempt / Failure	Min Increment	Max Increment	Average	Minimum	Maximum
1	0	0	1200	1200	1200
2	1	3602	1802	1201	4802
3	1	14408	4505	1202	19210
4	1	57632	11263	1203	76842

And for test networks, with exponent 3 (multiplier 2):
Initial delay 0: (most schedules)

Attempt / Failure	Min Increment	Max Increment	Average	Minimum	Maximum
1	0	0	0	0	0
2	1	2	2	1	2
3	1	6	4	2	8
4	1	18	9	3	26
5	1	54	19	4	80
6	1	162	39	5	242
7	1	486	79	6	728

Initial delay 20 (bridge client bridge descriptors):

Attempt / Failure	Min Increment	Max Increment	Average	Minimum	Maximum
1	0	0	20	20	20
2	1	42	42	21	62
3	1	126	84	22	188

Changed 21 months ago by teor

Attachment: Exponential Backoff.xlsx added

Spreadsheet used to calculate averages and ranges (sorry, Excel)

comment:20 Changed 21 months ago by teor

Parent ID: #20499

comment:21 Changed 19 months ago by nickm

Keywords: triaged-out-20170124 added
Milestone: Tor: 0.3.0.x-finalTor: 0.3.1.x-final

comment:22 Changed 17 months ago by nickm

Priority: MediumLow

Lower priority on some of my assigned tickets

comment:23 Changed 16 months ago by nickm

Keywords: 031-reach added

comment:24 Changed 15 months ago by catalyst

Cc: catalyst added

comment:25 Changed 15 months ago by nickm

Milestone: Tor: 0.3.1.x-finalTor: 0.3.2.x-final

I'm not going to be able to get this right in 0.3.1

comment:26 Changed 15 months ago by nickm

Keywords: CoreTorTeam201611 removed

comment:27 Changed 15 months ago by nickm

Keywords: 031-reach removed

comment:28 Changed 11 months ago by nickm

What's our status here? Should we merge something else in 0.3.2, or close this and document it, or some other thing?

comment:29 in reply to:  18 Changed 11 months ago by teor

Replying to nickm:

Merged, leaving open for documentation

We should document and close this.
Given that things are working, the other child tickets aren't a high priority.

comment:30 Changed 11 months ago by nickm

Milestone: Tor: 0.3.2.x-finalTor: 0.3.3.x-final
Owner: nickm deleted
Status: acceptedassigned

Okay; then I'll unassign from myself (I can't write this documentation) and defer to 0.3.3.

comment:31 Changed 11 months ago by nickm

Status: assignednew

comment:32 Changed 6 months ago by nickm

Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final
Type: defectenhancement

Label a bunch of (arguable and definite) enhancements as enhancements for 0.3.4.

comment:33 Changed 5 months ago by nickm

Keywords: 034-triage-20180328 added

comment:34 Changed 5 months ago by nickm

Keywords: 034-removed-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

comment:35 Changed 4 months ago by nickm

Milestone: Tor: 0.3.4.x-finalTor: unspecified

These tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if time permits.

comment:36 Changed 7 weeks ago by teor

Resolution: wontfix
Status: newclosed

These schedules use exponential backoff with decorrelated jitter, with no maximums. That seems to be working well enough,

Note: See TracTickets for help on using tickets.