We should tweak the download schedules in config.c based on what we've learned in #20499 (moved).
These schedules should retry sooner than never:
TestingServerDownloadSchedule
TestingClientDownloadSchedule
These schedules retry at most every 2 hours, should that be higher?
TestingServerConsensusDownloadSchedule
These schedules retry at most every 12 hours, should that be higher? lower?
TestingClientConsensusDownloadSchedule
These schedules retry at most every 73 hours, should that be lower?
Should we try more times before jumping to retrying after an hour?
ClientBootstrapConsensusAuthorityDownloadSchedule
ClientBootstrapConsensusFallbackDownloadSchedule
ClientBootstrapConsensusAuthorityOnlyDownloadSchedule
Should we try more than 7 or 8 times to get directory documents?
TestingConsensusMaxDownloadTries
ClientBootstrapConsensusMaxDownloadTries
TestingDescriptorMaxDownloadTries
TestingMicrodescMaxDownloadTries
TestingCertMaxDownloadTries
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
Actually, there's not much point in revising these schedules - the exponential backoff code only pays attention to the first and last value in the schedule.
Actually, there's not much point in revising these schedules - the exponential backoff code only pays attention to the first and last value in the schedule.
teor, should we keep this in 029 then? Or the ticket at all?
Actually, there's not much point in revising these schedules - the exponential backoff code only pays attention to the first and last value in the schedule.
teor, should we keep this in 029 then? Or the ticket at all?
If we revert the exponential backoff code as our solution to #20499 (moved), we will want to tweak the final times on some of these schedules, to achieve Roger's goal of "never have a relay stop trying entirely".
If we don't, we will want to adjust the initial time as well.
So I think my comment was a bit hasty - it's only the middle times that don't matter, and only if we keep exponential backoff.
I think that the answer here is just to remove the final times entirely, or make them not count when the schedule is exponential. I've done this as part of my branch 20499_part1_029.
In 0.2.9, almost every tor instance will try to download almost every document 11 times in the first minute. In 0.2.8, this was 3-4 times in the first minute.
In 0.2.9, clients will try authorities 5 times in the first minute. In 0.2.8, this was 2 times in the first minute.
In 0.2.9, bridge clients will try to download bridge descriptors 2 times in the first 3 hours. In 0.2.8, this was 4 times in the first 3 hours.
We can't fix this by modifying the minimum times. But we might be able to fix it by modifying the exponent. Or providing a failure count at which we increase the delay to hourly, rather than slowly increasing it in an exponential fashion.
Here's what it would look like if the exponent were 3 (max = 5delay) rather than 1.5 (max = 2delay), and we adjusted the client to authority and bridge descriptor start times:
In 0.2.9, almost every tor instance would try to download almost every document 5 times in the first minute. In 0.2.8, this was 3-4 times in the first minute.
In 0.2.9, clients would try authorities 3 times in the first minute (and twice in the first 24 seconds). In 0.2.8, this was 2 times in the first minute (and twice in the first 21 seconds).
In 0.2.9, bridge clients will try to download bridge descriptors 3 times in the first 3 hours (the first time after 20 minutes). In 0.2.8, this was 4 times in the first 3 hours (the first time after 1 hour).
Okay, I've taken a look here in my branch bug20534_029. Is that what you had in mind? I've not been able to convince myself 5x is safe, so I went with 4x. Let's see if that works out.
Okay, I've taken a look here in my branch bug20534_029. Is that what you had in mind? I've not been able to convince myself 5x is safe, so I went with 4x. Let's see if that works out.
That's fine, I was concerned 5x would lead to too much variance.
One nitpick: no more than quadruple. is wrong, it is no more than quintuple., because the increment is added to the existing delay.
Let me just do my sums for exponent 2.5 (max = 4*delay):
In 0.2.9, almost every tor instance would try to download almost every document 6 times in the first minute. In 0.2.8, this was 3-4 times in the first minute.
In 0.2.9, clients would try authorities 3 times in the first minute (and twice in the first 21 seconds). In 0.2.8, this was 2 times in the first minute (and twice in the first 21 seconds).
In 0.2.9, bridge clients will try to download bridge descriptors 3 times in the first 3 hours (the first time after 20 minutes). In 0.2.8, this was 4 times in the first 3 hours (the first time after 1 hour).
I think we should document this somewhere, but that should be a separate task.
(I'm sure arma would like to read this analysis, eventually.)
Here's a relevant comment from the 0.2.8 #4483 (moved) implementation in config.c. We can contrast it with the 0.2.9 #15942 (moved) implementation:
/* With the ClientBootstrapConsensus*Download* below: * Clients with only authorities will try: * - 3 authorities over 10 seconds, then wait 60 minutes. * Clients with authorities and fallbacks will try: * - 2 authorities and 4 fallbacks over 21 seconds, then wait 60 minutes. * Clients will also retry when an application request arrives. * After a number of failed reqests, clients retry every 3 days + 1 hour. * * Clients used to try 2 authorities over 10 seconds, then wait for * 60 minutes or an application request. * * When clients have authorities and fallbacks available, they use these * schedules: (we stagger the times to avoid thundering herds) */
Also, authority-only clients will try 6 authorities in the first minute. (5 in the first 30 seconds, 4 in the first 10 seconds). This isn't ideal, but it's also not the default, so I don't think it matters that much.
One way to fix this would be to make ClientBootstrapConsensusAuthorityOnlyDownloadSchedule start with 1 or 2 or 3, rather than 0. (What I really want is a way to say: "initial delay, then exponential on this other delay after that".)
Here's how the latest version works, with the next delay in [1, delay*5] (multiplier 4), except when delay is 0, when the next delay is in [1,2]. This time, I'm using ranges and averages:
I wanted a multiplier of 2.5, but we settled on 3, because, integers.
In 0.3.0 or 0.3.1, I think we should consider having the next delay in [delay2, delay3], rather than [delay, delay*3]. This would increase the average and lower bound, and decrease the upper bound (I don't like the range in the current model). The initial values might need some tweaking after this change.