Revise hard-coded download schedules

changed milestone to %Tor: unspecified

added 034-removed-20180328 034-triage-20180328 component::core tor/tor milestone::Tor: unspecified points::0.5 priority::low regression resolution::wontfix severity::normal status::closed triaged-out-20170124 type::enhancement version::tor 0.2.9.1-alpha labels

Trac:
Parent: N/A to #20499 (moved)

Some of these schedules also affect #19969 (moved)

Actually, there's not much point in revising these schedules - the exponential backoff code only pays attention to the first and last value in the schedule.

Replying to teor:

Actually, there's not much point in revising these schedules - the exponential backoff code only pays attention to the first and last value in the schedule.

teor, should we keep this in 029 then? Or the ticket at all?

Trac:
Status: new to needs_information

Replying to dgoulet:

Replying to teor:

Actually, there's not much point in revising these schedules - the exponential backoff code only pays attention to the first and last value in the schedule.

teor, should we keep this in 029 then? Or the ticket at all?

If we revert the exponential backoff code as our solution to #20499 (moved), we will want to tweak the final times on some of these schedules, to achieve Roger's goal of "never have a relay stop trying entirely".

If we don't, we will want to adjust the initial time as well.

So I think my comment was a bit hasty - it's only the middle times that don't matter, and only if we keep exponential backoff.

I think that the answer here is just to remove the final times entirely, or make them not count when the schedule is exponential. I've done this as part of my branch 20499_part1_029.

Trac:
Status: needs_information to needs_review

I've reviewed 20499_part1_029 over in #20499 (moved).

Here are the schedules we used to use before exponential backoff was implemented:

TestingServerDownloadSchedule "0, 0, 0, 60, 60, 120, 300, 900, 2147483647"
TestingClientDownloadSchedule "0, 0, 60, 300, 600, 2147483647"
TestingServerConsensusDownloadSchedule "0, 0, 60, 300, 600, 1800, 1800, 1800, 1800, 1800, 3600, 7200"
TestingClientConsensusDownloadSchedule "0, 0, 60, 300, 600, 1800, 3600, 3600, 3600, 10800, 21600, 43200"
ClientBootstrapConsensusFallbackDownloadSchedule "0, 1, 4, 11, 3600, 10800, 25200, 54000, 111600, 262800"
ClientBootstrapConsensusAuthorityOnlyDownloadSchedule "0, 3, 7, 3600, 10800, 25200, 54000, 111600, 262800"
ClientBootstrapConsensusAuthorityDownloadSchedule "10, 11, 3600, 10800, 25200, 54000, 111600, 262800"
TestingBridgeDownloadSchedule "3600, 900, 900, 3600"

And here are the average exponential backoff attempt times for each unique starting point above:

0, 1, 2, 3.5, 5.5, 8.5, 13, 19.5, 29, 43.5, 65.5, ...
10, 15, 22.5, 34, 51, 76.5, ...
3600, 5400, 8100, ...

This means:

In 0.2.9, almost every tor instance will try to download almost every document 11 times in the first minute. In 0.2.8, this was 3-4 times in the first minute.
In 0.2.9, clients will try authorities 5 times in the first minute. In 0.2.8, this was 2 times in the first minute.
In 0.2.9, bridge clients will try to download bridge descriptors 2 times in the first 3 hours. In 0.2.8, this was 4 times in the first 3 hours.

We can't fix this by modifying the minimum times. But we might be able to fix it by modifying the exponent. Or providing a failure count at which we increase the delay to hourly, rather than slowly increasing it in an exponential fashion.

Here's what it would look like if the exponent were 3 (max = 5delay) rather than 1.5 (max = 2delay), and we adjusted the client to authority and bridge descriptor start times:

0, 1, 3, 9, 27, 81, ...
6, 18, 54, ...
1200, 3600, 10800, ...

This would mean:

In 0.2.9, almost every tor instance would try to download almost every document 5 times in the first minute. In 0.2.8, this was 3-4 times in the first minute.
In 0.2.9, clients would try authorities 3 times in the first minute (and twice in the first 24 seconds). In 0.2.8, this was 2 times in the first minute (and twice in the first 21 seconds).
In 0.2.9, bridge clients will try to download bridge descriptors 3 times in the first 3 hours (the first time after 20 minutes). In 0.2.8, this was 4 times in the first 3 hours (the first time after 1 hour).

So, in summary, here are the 3 changes I think resolve #20499 (moved), by making the exponential backoff schedules much more like the 0.2.8 schedules:

adjust max_increment to be delay*5, which makes the exponent (1+5)/2 = 3
make the first entry in ClientBootstrapConsensusAuthorityDownloadSchedule 6
make the first entry in TestingBridgeDownloadSchedule 1200

Trac:
Version: N/A to Tor: 0.2.9.1-alpha
Keywords: N/A deleted, CoreTorTeam201611 added
Status: needs_review to new

Okay, I've taken a look here in my branch bug20534_029. Is that what you had in mind? I've not been able to convince myself 5x is safe, so I went with 4x. Let's see if that works out.

Trac:
Status: new to needs_review

Replying to nickm:

Okay, I've taken a look here in my branch bug20534_029. Is that what you had in mind? I've not been able to convince myself 5x is safe, so I went with 4x. Let's see if that works out.

That's fine, I was concerned 5x would lead to too much variance.

One nitpick: no more than quadruple. is wrong, it is no more than quintuple., because the increment is added to the existing delay.

Let me just do my sums for exponent 2.5 (max = 4*delay):

0, 1, 2.5, 6.3, 15.6, 39.1, 97.7, ... 6, 15, 37.5, 93.75, ... 1200, 3000, 7500, 18750, ...

This would mean:

In 0.2.9, almost every tor instance would try to download almost every document 6 times in the first minute. In 0.2.8, this was 3-4 times in the first minute.
In 0.2.9, clients would try authorities 3 times in the first minute (and twice in the first 21 seconds). In 0.2.8, this was 2 times in the first minute (and twice in the first 21 seconds).
In 0.2.9, bridge clients will try to download bridge descriptors 3 times in the first 3 hours (the first time after 20 minutes). In 0.2.8, this was 4 times in the first 3 hours (the first time after 1 hour).

I think we should document this somewhere, but that should be a separate task. (I'm sure arma would like to read this analysis, eventually.)

Trac:
Status: needs_review to merge_ready

I'll fix and merge then, and leave the ticket open for a documentation spree.

Here's a relevant comment from the 0.2.8 #4483 (moved) implementation in config.c. We can contrast it with the 0.2.9 #15942 (moved) implementation:

 /* With the ClientBootstrapConsensus*Download* below:
   * Clients with only authorities will try:
   *  - 3 authorities over 10 seconds, then wait 60 minutes.
   * Clients with authorities and fallbacks will try:
   *  - 2 authorities and 4 fallbacks over 21 seconds, then wait 60 minutes.
   * Clients will also retry when an application request arrives.
   * After a number of failed reqests, clients retry every 3 days + 1 hour.
   *
   * Clients used to try 2 authorities over 10 seconds, then wait for
   * 60 minutes or an application request.
   *
   * When clients have authorities and fallbacks available, they use these
   * schedules: (we stagger the times to avoid thundering herds) */

Also, authority-only clients will try 6 authorities in the first minute. (5 in the first 30 seconds, 4 in the first 10 seconds). This isn't ideal, but it's also not the default, so I don't think it matters that much.

One way to fix this would be to make ClientBootstrapConsensusAuthorityOnlyDownloadSchedule start with 1 or 2 or 3, rather than 0. (What I really want is a way to say: "initial delay, then exponential on this other delay after that".)

Here's how the latest version works, with the next delay in [1, delay*5] (multiplier 4), except when delay is 0, when the next delay is in [1,2]. This time, I'm using ranges and averages:

0, 1.5 [None..None](../compare/None...None), 4.3 [None..None](../compare/None...None), 11 [None..None](../compare/None...None), 28 [None..None](../compare/None...None), 71 [None..None](../compare/None...None), ...
6, 19 [None..None](../compare/None...None), 47 [None..None](../compare/None...None), 117 [None..None](../compare/None...None), 294 [None..None](../compare/None...None), ...
1200, 3601 [None..None](../compare/None...None), 9002 [None..None](../compare/None...None), 22505 [None..None](../compare/None...None), ...

Ok, so that multiplier is going to be terrible (in rare cases) for clients, let's try delay*4 (multiplier 3):

0, 1.5 [None..None](../compare/None...None), 5 [None..None](../compare/None...None), 11 [None..None](../compare/None...None), 22 [None..None](../compare/None...None), 44 [None..None](../compare/None...None), 88 [None..None](../compare/None...None), ...
6, 16 [None..None](../compare/None...None), 32 [None..None](../compare/None...None), 64 [None..None](../compare/None...None), 128 [None..None](../compare/None...None), ...
1200, 3001 [None..None](../compare/None...None), 6002 [None..None](../compare/None...None), 12004 [None..None](../compare/None...None), ...

And delay*3 (multiplier 2):

0, 1.5 [None..None](../compare/None...None), 4 [None..None](../compare/None...None), 6.5 [None..None](../compare/None...None), 10 [None..None](../compare/None...None), 16 [None..None](../compare/None...None), 24 [None..None](../compare/None...None), 37 [None..None](../compare/None...None), 56 [None..None](../compare/None...None), 85 [None..None](../compare/None...None), ...
6, 13 [None..None](../compare/None...None), 19 [None..None](../compare/None...None), 29 [None..None](../compare/None...None), 45 [None..None](../compare/None...None), ...
1200, 2400 [None..None](../compare/None...None), 3601 [None..None](../compare/None...None), 5402 [None..None](../compare/None...None), ...

So a multiplier of 2 is too fast, 3 seems just about right.

I wanted a multiplier of 2.5, but we settled on 3, because, integers.

In 0.3.0 or 0.3.1, I think we should consider having the next delay in [delay2, delay3], rather than [delay, delay*3]. This would increase the average and lower bound, and decrease the upper bound (I don't like the range in the current model). The initial values might need some tweaking after this change.

Merged, leaving open for documentation

Trac:
Milestone: Tor: 0.2.9.x-final to Tor: 0.3.0.x-final
Status: merge_ready to accepted
Owner: N/A to nickm

Turns out that sometimes I just can't do my sums right. Here are the actual figures for each case with exponent 4 (multiplier 3):

Initial delay 0: (most schedules)

Attempt / Failure	Min Increment	Max Increment	Average	Minimum	Maximum
1	0	0	0	0	0
2	1	2	2	1	2
3	1	8	5	2	10
4	1	32	13	3	42
5	1	128	33	4	170
6	1	512	84	5	682
7	1	2048	211	6	2730

Initial delay 6: (client bootstrap from authorities)

Attempt / Failure	Min Increment	Max Increment	Average	Minimum	Maximum
1	0	0	6	6	6
2	1	20	11	7	26
3	1	80	27	8	106
4	1	320	69	9	426
5	1	1280	174	10	1706

Initial delay 1200: (bridge client bridge descriptors)

Attempt / Failure	Min Increment	Max Increment	Average	Minimum	Maximum
1	0	0	1200	1200	1200
2	1	3602	1802	1201	4802
3	1	14408	4505	1202	19210
4	1	57632	11263	1203	76842

And for test networks, with exponent 3 (multiplier 2): Initial delay 0: (most schedules)

Attempt / Failure	Min Increment	Max Increment	Average	Minimum	Maximum
1	0	0	0	0	0
2	1	2	2	1	2
3	1	6	4	2	8
4	1	18	9	3	26
5	1	54	19	4	80
6	1	162	39	5	242
7	1	486	79	6	728

Initial delay 20 (bridge client bridge descriptors):

Attempt / Failure	Min Increment	Max Increment	Average	Minimum	Maximum
1	0	0	20	20	20
2	1	42	42	21	62
3	1	126	84	22	188

Trac:
Exponential_Backoff.xlsx

Spreadsheet used to calculate averages and ranges (sorry, Excel)

Trac:
Parent: #20499 (moved) to N/A

Trac:
Milestone: Tor: 0.3.0.x-final to Tor: 0.3.1.x-final
Keywords: N/A deleted, triaged-out-20170124 added

Lower priority on some of my assigned tickets

Trac:
Priority: Medium to Low

Trac:
Keywords: N/A deleted, 031-reach added

Trac:
Cc: N/A to catalyst

I'm not going to be able to get this right in 0.3.1

Trac:
Milestone: Tor: 0.3.1.x-final to Tor: 0.3.2.x-final

Trac:
Keywords: CoreTorTeam201611 deleted, N/A added

Trac:
Keywords: 031-reach deleted, N/A added

What's our status here? Should we merge something else in 0.3.2, or close this and document it, or some other thing?

Replying to nickm:

Merged, leaving open for documentation

We should document and close this. Given that things are working, the other child tickets aren't a high priority.

Okay; then I'll unassign from myself (I can't write this documentation) and defer to 0.3.3.

Trac:
Owner: nickm to N/A
Milestone: Tor: 0.3.2.x-final to Tor: 0.3.3.x-final
Status: accepted to assigned

Trac:
Status: assigned to new

Label a bunch of (arguable and definite) enhancements as enhancements for 0.3.4.

Trac:
Milestone: Tor: 0.3.3.x-final to Tor: 0.3.4.x-final
Type: defect to enhancement

Trac:
Keywords: N/A deleted, 034-triage-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

Trac:
Keywords: N/A deleted, 034-removed-20180328 added

These tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if time permits.

Trac:
Milestone: Tor: 0.3.4.x-final to Tor: unspecified

These schedules use exponential backoff with decorrelated jitter, with no maximums. That seems to be working well enough,

Trac:
Status: new to closed
Resolution: N/A to wontfix

closed

changed time estimate to 4h

mentioned in issue #20535 (moved)

mentioned in issue #20597 (moved)

mentioned in issue #20604 (moved)

mentioned in issue #20605 (moved)

mentioned in issue #20606 (moved)

mentioned in issue #20909 (moved)

mentioned in issue #22924 (moved)

moved to tpo/core/tor#20534 (closed)

Revise hard-coded download schedules

Child items 0

Activity