Opened 10 months ago

Closed 9 months ago

#20499 closed defect (fixed)

A running Tor won't update the microdesc consensus

Reported by: rubiate Owned by: nickm
Priority: High Milestone: Tor: 0.2.9.x-final
Component: Core Tor/Tor Version: Tor: 0.2.9.4-alpha
Severity: Normal Keywords: regression, CoreTorTeam201611
Cc: Actual Points:
Parent ID: #20534 Points:
Reviewer: Sponsor:

Description

I am observing that my relay and bridge will update the microdesc consensus when they are restarted or catch SIGHUP, but not while they are running. In the case of the bridge, the consensus it serves eventually falls out of date, and clients that try to connect through it will hang on "I learned some more directory information, but not enough to build a circuit: We have no recent usable consensus" and never connect to the network.

The bridge and relay I seen this happening on are running 0.2.9.4-alpha on OpenBSD. I also spun up a new bridge on Debian (also running 0.2.9.4-alpha) and it appears to have the same problem. This does not appear to happen with 0.2.8.9.

What it looks like is happening:

At startup (or reload) the relay fetches the microdesc consensus

1 minute later it tries to fetch it again (update_consensus_networkstatus_downloads() is called) and receives a 304 response as it hasn't been modified

download_status_increment_failure() gets called with a status_code of 304

update_consensus_networkstatus_downloads() gets called again, this time it stops at the call to connection_dir_count_by_purpose_and_resource() which returns 1 (equal to max_in_progress_conns)

download_status_increment_failure() gets called again, this time with a status_code of 0 (as a result each 304 response results in the fail count being increased by 2)

The previous steps repeat every minute for a few minutes until the failure count reaches 10 (exceeding the max fail count of 8)

At this point it still keeps retrying every minute but download_status_is_ready() doesn't return true as the failure count exceeds the max, so it skips it without trying to fetch it

Eventually the consensus falls out of date, but download_is_ready() still won't return true so it won't try and fetch a new one

On 0.2.8.9 it makes a couple of attempts that fail with a 304 response but download_is_ready() will eventually start returning false as the value of dts->next_attempt_at is greater than the current time. It seems on 0.2.8.9 next_attempt_at is increased a lot more aggressively, first by 1 minute, then 10 minutes and then an hour, so it accumulates a failure count of 6 but then waits long enough that the next attempt succeeds.

On 0.2.9.4-alpha, it looks like the value of next_attempt_at is increased more slowly, by only seconds at a time, so it reattempts every minute and quickly reaches the failure limit.

Child Tickets

TicketStatusOwnerSummaryComponent
#20533closedEach download request should only increment the failure count onceCore Tor/Tor
#20536closedShould tor keep on retrying, even if it has reached the failure limit?Core Tor/Tor
#20591closedEnsure relays don't make multiple connections during bootstrapCore Tor/Tor
#20593closedAvoid resetting download status on 503Core Tor/Tor

Change History (37)

comment:1 Changed 10 months ago by dgoulet

Keywords: regression added
Milestone: Tor: 0.2.9.x-final
Priority: MediumHigh

comment:2 Changed 10 months ago by arma

This looks very similar to my "mystery 1" in #19969 -- so I am going to bring that discussion over here so we can keep the bugs separate.

For me it happened on "Tor 0.2.9.3-alpha-dev (git-bfaded9143d127cb)" (which is alas not in a released version of Tor, because #20269 isn't merged yet, but suffice to say it's partway between 0.2.9.3 and 0.2.9.4). And this was just a client, not a bridge or relay.

teor asked in #19969 what my "consensus download_status_t has for all its fields, particularly the attempt and failure counts." Here is what gdb says:

(gdb) print consensus_dl_status
$8 = {{next_attempt_at = 0, n_download_failures = 0 '\000',
    n_download_attempts = 0 '\000', schedule = DL_SCHED_CONSENSUS,
    want_authority = DL_WANT_ANY_DIRSERVER,
    increment_on = DL_SCHED_INCREMENT_FAILURE,
    backoff = DL_SCHED_RANDOM_EXPONENTIAL, last_backoff_position = 0 '\000',
    last_delay_used = 0}, {next_attempt_at = 1477296555,
    n_download_failures = 17 '\021', n_download_attempts = 17 '\021',
    schedule = DL_SCHED_CONSENSUS, want_authority = DL_WANT_ANY_DIRSERVER,
    increment_on = DL_SCHED_INCREMENT_FAILURE,
    backoff = DL_SCHED_RANDOM_EXPONENTIAL, last_backoff_position = 17 '\021',
    last_delay_used = 387}}

The first half of that makes sense, since my Tor doesn't touch the non-microdesc consensus stuff. But for the second half... it looks like my next_attempt_at is ~6 days ago.

comment:3 Changed 10 months ago by arma

teor also asked if my Tor has marked each of the directory authorities down. I believe the answer is yes -- all the entries in trusted_dir_servers has is_running set to 0, except the Bifroest entry (which make sense). Here is an example to be thorough:

(gdb) print *(dir_server_t *)(trusted_dir_servers->list[0])
$16 = {
  description = 0x7f57f56c0770 "directory server \"moria1\" at 128.31.0.39:9131"
, nickname = 0x7f57f56c0750 "moria1",
  address = 0x7f57f56c0600 "128.31.0.39", ipv6_addr = {family = 0, addr = {
      dummy_ = 0, in_addr = {s_addr = 0}, in6_addr = {__in6_u = {
          __u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0,
            0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}}}, addr = 2149515303,
  dir_port = 9131, or_port = 9101, ipv6_orport = 0, weight = 1,
  digest = "\226\225\337\303_\376\270a2\233\237\032\260LF9p \316\061",
  v3_identity_digest = "Õ<86>Ñ<83>\t\336\324\315mW\301\217Û<97>\357\251m3\005f",
  is_running = 0, is_authority = 1, has_accepted_serverdesc = 0,
  type = (V3_DIRINFO | EXTRAINFO_DIRINFO | MICRODESC_DIRINFO), 
  addr_current_at = 1448282868, fake_status = {published_on = 0, 
    nickname = "moria1", '\000' <repeats 13 times>, 
    identity_digest = "\226\225\337\303_\376\270a2\233\237\032\260LF9p \316\061", descriptor_digest = '\000' <repeats 31 times>, addr = 2149515303, 
    or_port = 9101, dir_port = 9131, ipv6_addr = {family = 0, addr = {
        dummy_ = 0, in_addr = {s_addr = 0}, in6_addr = {__in6_u = {
            __u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 
              0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}}}, 
    ipv6_orport = 0, is_authority = 0, is_exit = 0, is_stable = 0, 
    is_fast = 0, is_flagged_running = 0, is_named = 0, is_unnamed = 0, 
    is_valid = 0, is_possible_guard = 0, is_bad_exit = 0, is_hs_dir = 0,
    is_v2_dir = 0, protocols_known = 0, supports_extend2_cells = 0,
    has_bandwidth = 0, has_exitsummary = 0, bw_is_unmeasured = 0,
    bandwidth_kb = 0, has_guardfraction = 0, guardfraction_percentage = 0,
    exitsummary = 0x0, last_dir_503_at = 0, dl_status = {next_attempt_at = 0,
      n_download_failures = 0 '\000', n_download_attempts = 0 '\000',
      schedule = DL_SCHED_GENERIC, want_authority = DL_WANT_ANY_DIRSERVER,
      increment_on = DL_SCHED_INCREMENT_FAILURE,
      backoff = DL_SCHED_DETERMINISTIC, last_backoff_position = 0 '\000',
      last_delay_used = 0}}}

comment:4 Changed 10 months ago by arma

teor also asked if my tor has marked all the fallback dirs down.

The fallback_dir_servers smartlist has 90 elements so I didn't check all of them, but for the first few, is_running was set to 1. That is, no, my Tor seemed to think that the fallback dirs were just fine to contact. It simply chose not to contact them.

comment:5 Changed 10 months ago by nickm

I wonder if the new random_exponential_backoff code in 0.2.9 could be at issue.

comment:6 Changed 10 months ago by arma

Summary: Relay/Bridge won't update the microdesc consensus while runningA running Tor won't update the microdesc consensus

(changing the title since this happened to my tor client too)

comment:7 Changed 10 months ago by arma

To give some other context to folks reading this ticket, here is some more debugging detail (all from my Tor client that has been opting not to retrieve a new consensus for the past week+):

I set a breakpoint on fetch_networkstatus_callback, and learned that prefer_mirrors is 1, and we_are_bootstrapping is 1. should_delay_dir_fetches() was 0. It called update_networkstatus_downloads as expected.

Then I set a breakpoint on update_consensus_networkstatus_downloads. Again we_are_bootstrapping is 1. use_multi_conn alas is <optimized out>, but looking at the function, I assume it's 1 for me. The first round through the loop, for vanilla consensus flavor, we_want_to_fetch_flavor() is no, so I move on to the second round through the loop. I don't know how
c = networkstatus_get_latest_consensus_by_flavor(i); goes because "print c" also says <optimized out>, but it looks like it runs time_to_download_next_consensus[i] = now; next, so I can assume that c was NULL. Then it keeps going through the function until it calls
update_consensus_bootstrap_multiple_downloads(). Looks plausible.

So I set a breakpoint on update_consensus_bootstrap_multiple_downloads, which was trickier than I would have wanted since it looks like my compiler inlined it into update_consensus_networkstatus_downloads. But it looks like it does make two calls to update_consensus_bootstrap_attempt_downloads -- one with dls_f, and the next with dls_a.

So I set a breakpoint on update_consensus_bootstrap_attempt_downloads. It sets max_dl_tries to 7, which makes sense since I see it there in config.c, set to default to 7.

Now it gets interesting:

(gdb) print *dls
$3 = {next_attempt_at = 1477295479, n_download_failures = 0 '\000', 
  n_download_attempts = 8 '\b', schedule = DL_SCHED_CONSENSUS, 
  want_authority = DL_WANT_ANY_DIRSERVER, 
  increment_on = DL_SCHED_INCREMENT_ATTEMPT, 
  backoff = DL_SCHED_RANDOM_EXPONENTIAL, last_backoff_position = 8 '\b', 
  last_delay_used = 7}

n_download_attempts is 8, and max_dl_tries is 7. I wonder where this is going!

Turns out download_status_is_ready() is no fun to gdb in because it's an inline tucked into directory.h, but looking at the code, it seems clear it returns 0, i.e. not ready. Then the function exits.

Then the same thing happens with the update_consensus_bootstrap_attempt_downloads call that was for dls_a.

Conclusion, I somehow failed 8 times to get a consensus, and now I will never try again.

comment:8 Changed 10 months ago by arma

Looking back at rubiate's analysis, my debugging matches theirs.

comment:9 Changed 10 months ago by arma

For thoroughness,

(gdb) print consensus_bootstrap_dl_status
$6 = {{next_attempt_at = 1477296321, n_download_failures = 0 '\000', 
    n_download_attempts = 8 '\b', schedule = DL_SCHED_CONSENSUS, 
    want_authority = DL_WANT_AUTHORITY, 
    increment_on = DL_SCHED_INCREMENT_ATTEMPT, 
    backoff = DL_SCHED_RANDOM_EXPONENTIAL, last_backoff_position = 8 '\b', 
    last_delay_used = 273}, {next_attempt_at = 1477295479, 
    n_download_failures = 0 '\000', n_download_attempts = 8 '\b', 
    schedule = DL_SCHED_CONSENSUS, want_authority = DL_WANT_ANY_DIRSERVER, 
    increment_on = DL_SCHED_INCREMENT_ATTEMPT, 
    backoff = DL_SCHED_RANDOM_EXPONENTIAL, last_backoff_position = 8 '\b', 
    last_delay_used = 7}}

comment:10 Changed 10 months ago by arma

#8625 has this choice quote: "I say we merge and wait to see if we get bugs reported?"

comment:11 Changed 10 months ago by arma

Easy answer is to revert 09a0f2d0b24 until we've designed a better download schedule and a better mechanism for testing it?

comment:12 Changed 10 months ago by arma

rubiate, are you able to reproduce this bug consistently? If so, can you spin up a relay or bridge with commit 09a0f2d0b24 reverted, and see how that fares? My guess is that it will fare much better.

In the mean time, I've opened #20501 to look at the Tor network for relays that were bitten by this bug (seems like quite a few), and #20509 for doing something about getting them off the network and/or taking away their Guard flag so clients don't get stuck behind them and then be unable to use Tor.

comment:13 Changed 10 months ago by nickm

I concur that for 0.2.9, reverting 09a0f2d0b seems like a good choice. If we feel cheeky, we might increase the interval, though...

comment:14 Changed 10 months ago by rubiate

I set up two new relays:

bug20499bad ED6E43F07ABE87C017BD80D4BA24E41F8FF32E94
bug20499revert 083971FD18EDBD442DF0971D0FDC6F500642AD91

'bad is running 0.2.9.4-alpha, and 'revert is identical except it has 09a0f2d0b24 reverted. Both have the same config, other than cosmetic differences (different names, ports, logfile).

Rough timeline:

10:36 Both start up
[...] both request the consensus every minute
10:41 they reach a fail count of 10
[...] both do nothing every minute, fail count is too high
11:36 'revert updates the microdesc consensus; valid-until 2016-11-01 14:00:00
11:36 'bad is still doing nothing, valid-until 2016-11-01 13:00:00
11:37 'revert tries to do it again, gets a 304, fail count is 2
11:38 'revert tries to do it again, gets a 304, fail count is 4
[...] and so on
11:41 'revert reaches a fail count of 10 again
[...] both do nothing
12:36 'revert updates; valid-until 2016-11-01 15:00:00
12:36 'bad doesn't; valid-until 2016-11-01 13:00:00
12:41 'revert hits 10 fails again
[...] doin' nothin'
13:00 'bad now has a microdesc consensus past its valid-until date
[...]
13:36 'revert updates; valid-until 2016-11-01 16:00:00
13:36 'bad doesn't; valid-until 2016-11-01 13:00:00
[...] I suspect this pattern is going to hold.

'revert:
http://45.32.188.229:9209/tor/status-vote/current/consensus-microdesc

'bad:
http://45.32.188.229:9201/tor/status-vote/current/consensus-microdesc

comment:15 Changed 10 months ago by teor

Perhaps max_dl_tries is also far too low, and we should increase it somewhat? 10?

comment:16 Changed 10 months ago by nickm

So, if we're truly on exponential backoff, no maximum could be too large, right?

I also wonder, why are these failure counts so high?

comment:17 in reply to:  16 ; Changed 10 months ago by teor

Replying to nickm:

So, if we're truly on exponential backoff, no maximum could be too large, right?

Technically, yes.

But at some exponent, the wait time becomes indistinguishable from failure.
(Which is why we need to make sure requests trigger a new attempt.)

I guess this essentially implements hibernate mode then?

And we could just put the failure count up to something quite high, let's say, at most, the failure number at which tor is waiting for the average time between tor stable releases?

I also wonder, why are these failure counts so high?

Firstly, because they get incremented twice for each failure.

download_status_increment_failure() gets called with a status_code of 304

update_consensus_networkstatus_downloads() gets called again, this time it stops at the call to connection_dir_count_by_purpose_and_resource() which returns 1 (equal to max_in_progress_conns)

download_status_increment_failure() gets called again, this time with a status_code of 0 (as a result each 304 response results in the fail count being increased by 2)

And secondly, because the laptop was offline for 12? hours?

comment:18 Changed 10 months ago by teor

Oh, and because we have bad directory mirrors serving out of date consensuses repeatedly?

comment:19 in reply to:  14 ; Changed 10 months ago by arma

Replying to rubiate:

I set up two new relays:
[...]
Rough timeline:

10:36 Both start up
[...] both request the consensus every minute
10:41 they reach a fail count of 10

What is wrong with your relay set-up such that they both fail to get a consensus at bootstrap? :)

Are you firewalled in some weird way? Are they trying to fetch it from fallbackdirs and those are surprisingly faily? Are they trying from directory authorities and our authorities are no good?

Anyway, it looks like 'revert' is the winner, but it would still be great to learn what is so helpful about your test environment that it triggers this bug so well.

Last edited 10 months ago by arma (previous) (diff)

comment:20 in reply to:  19 Changed 10 months ago by teor

Replying to arma:

Replying to rubiate:

I set up two new relays:
[...]
Rough timeline:

10:36 Both start up
[...] both request the consensus every minute
10:41 they reach a fail count of 10

What is wrong with your relay set-up such that they both failure to get a consensus at bootstrap? :)

Are you firewalled in some weird way? Are they trying to fetch it from fallbackdirs and those are surprisingly faily? Are they trying from directory authorities and our authorities are no good?

Anyway, it looks like 'revert' is the winner, but it would still be great to learn what is so helpful about your test environment that it triggers this bug so well.

Well, they're in Australia, so latency is high, and measured bandwidth is low. But I'm not sure that would cause so many failures. Maybe something with OpenBSD?

My relay at the same provider has AccountingMax set, and has disabled its DirPort, so it's much harder to interrogate. It's on FreeBSD, but on 0.2.8.7 (still waiting for a package update), and up to date with its consensuses.

comment:21 in reply to:  17 ; Changed 10 months ago by arma

Replying to teor:

But at some exponent, the wait time becomes indistinguishable from failure.
(Which is why we need to make sure requests trigger a new attempt.)

It is good that we have the belt-and-suspenders fix in place where new client requests trigger a new attempt -- but that trick only works for clients. We should make sure that directory mirrors also have some way to reliably keep trying, and same for exit relays because of the should_refuse_unknown_exits() thing. Basically all of the reasons in directory_fetches_from_authorities().

I guess this essentially implements hibernate mode then?

And we could just put the failure count up to something quite high, let's say, at most, the failure number at which tor is waiting for the average time between tor stable releases?

It seems to me that any design that effectively has a "now you won't ask for the consensus anymore" possible outcome is a scary one here. Speaking of which. is there a place I should look to read about our current download design? I only know the one I wrote some years ago, and it looks like it's changed since then.

Firstly, because they get incremented twice for each failure.

I haven't looked into that one, but if so, can we open a new ticket for this (what looks like separate) bug?

And secondly, because the laptop was offline for 12? hours?

Actually, I think I drove my consensus download failure count up to 8 over the course of about ten minutes -- it launches each new try within a second of when the last one failed:

Oct 24 03:51:12.713 [info] update_consensus_bootstrap_attempt_downloads(): Launching microdesc bootstrap mirror networkstatus consensus download.
Oct 24 03:51:12.713 [info] update_consensus_bootstrap_attempt_downloads(): Launching microdesc bootstrap authority networkstatus consensus download.
Oct 24 03:51:42.725 [info] update_consensus_bootstrap_attempt_downloads(): Launching microdesc bootstrap authority networkstatus consensus download.
Oct 24 03:52:36.747 [info] update_consensus_bootstrap_attempt_downloads(): Launching microdesc bootstrap authority networkstatus consensus download.
Oct 24 03:54:14.787 [info] update_consensus_bootstrap_attempt_downloads(): Launching microdesc bootstrap authority networkstatus consensus download.
Oct 24 03:57:19.855 [info] update_consensus_bootstrap_attempt_downloads(): Launching microdesc bootstrap authority networkstatus consensus download.
Oct 24 04:00:48.938 [info] update_consensus_bootstrap_attempt_downloads(): Launching microdesc bootstrap authority networkstatus consensus download.

My laptop was closed (asleep) for more than a day, so when it woke up its consensus was more than 24 hours old, so it immediately jumped to bootstrap mode for its downloads. Ten minutes later, it had given up permanently.

comment:22 in reply to:  21 Changed 10 months ago by teor

Replying to arma:

Replying to teor:

But at some exponent, the wait time becomes indistinguishable from failure.
(Which is why we need to make sure requests trigger a new attempt.)

It is good that we have the belt-and-suspenders fix in place where new client requests trigger a new attempt -- but that trick only works for clients. We should make sure that directory mirrors also have some way to reliably keep trying, and same for exit relays because of the should_refuse_unknown_exits() thing. Basically all of the reasons in directory_fetches_from_authorities().

I guess this essentially implements hibernate mode then?

And we could just put the failure count up to something quite high, let's say, at most, the failure number at which tor is waiting for the average time between tor stable releases?

It seems to me that any design that effectively has a "now you won't ask for the consensus anymore" possible outcome is a scary one here. Speaking of which. is there a place I should look to read about our current download design? I only know the one I wrote some years ago, and it looks like it's changed since then.

Proposal 210 is close, but it's been modified by at least you, me, and andrea since then.

Firstly, because they get incremented twice for each failure.

I haven't looked into that one, but if so, can we open a new ticket for this (what looks like separate) bug?

#20533

And secondly, because the laptop was offline for 12? hours?

Actually, I think I drove my consensus download failure count up to 8 over the course of about ten minutes -- it launches each new try within a second of when the last one failed:

Oct 24 03:51:12.713 [info] update_consensus_bootstrap_attempt_downloads(): Launching microdesc bootstrap mirror networkstatus consensus download.
Oct 24 03:51:12.713 [info] update_consensus_bootstrap_attempt_downloads(): Launching microdesc bootstrap authority networkstatus consensus download.
Oct 24 03:51:42.725 [info] update_consensus_bootstrap_attempt_downloads(): Launching microdesc bootstrap authority networkstatus consensus download.
Oct 24 03:52:36.747 [info] update_consensus_bootstrap_attempt_downloads(): Launching microdesc bootstrap authority networkstatus consensus download.
Oct 24 03:54:14.787 [info] update_consensus_bootstrap_attempt_downloads(): Launching microdesc bootstrap authority networkstatus consensus download.
Oct 24 03:57:19.855 [info] update_consensus_bootstrap_attempt_downloads(): Launching microdesc bootstrap authority networkstatus consensus download.
Oct 24 04:00:48.938 [info] update_consensus_bootstrap_attempt_downloads(): Launching microdesc bootstrap authority networkstatus consensus download.

My laptop was closed (asleep) for more than a day, so when it woke up its consensus was more than 24 hours old, so it immediately jumped to bootstrap mode for its downloads. Ten minutes later, it had given up permanently.

That timing is off from what I would expect - when I designed it, it was:
Fallbacks: 0, 1, 5, 16, 3600, ...
Authorities: 10, 21, 3600, ...

But if we're skipping two on every failure, it could become:
Fallbacks: 0, (1 or 5 depending on exact failure timing), 3600, ...
Authorities: 10, 3600, ...
And if the client has all the authorities as down, I guess it won't even try them.

comment:23 in reply to:  21 Changed 10 months ago by teor

Replying to arma:

Replying to teor:

But at some exponent, the wait time becomes indistinguishable from failure.
(Which is why we need to make sure requests trigger a new attempt.)

It is good that we have the belt-and-suspenders fix in place where new client requests trigger a new attempt -- but that trick only works for clients. We should make sure that directory mirrors also have some way to reliably keep trying, and same for exit relays because of the should_refuse_unknown_exits() thing. Basically all of the reasons in directory_fetches_from_authorities().

I guess this essentially implements hibernate mode then?

And we could just put the failure count up to something quite high, let's say, at most, the failure number at which tor is waiting for the average time between tor stable releases?

It seems to me that any design that effectively has a "now you won't ask for the consensus anymore" possible outcome is a scary one here. Speaking of which. is there a place I should look to read about our current download design? I only know the one I wrote some years ago, and it looks like it's changed since then.

I logged #20534 for this. There are existing cases where we give up forever. We should tune them to do what we think we want.

comment:24 in reply to:  19 ; Changed 10 months ago by rubiate

Replying to arma:

What is wrong with your relay set-up such that they both fail to get a consensus at bootstrap? :)

Ah, no, they both got a perfectly good consensus at startup. The "failures" every minute after that are from them having a fresh consensus, requesting a new one anyway and getting 304 Not Modified in response.

comment:25 in reply to:  24 Changed 10 months ago by teor

Replying to rubiate:

Replying to arma:

What is wrong with your relay set-up such that they both fail to get a consensus at bootstrap? :)

Ah, no, they both got a perfectly good consensus at startup. The "failures" every minute after that are from them having a fresh consensus, requesting a new one anyway and getting 304 Not Modified in response.

So your reverted relay is bad: it retries 5 times every hour, when it should only try once an hour.
And your bad relay is also bad: it retries never, when it should at least try once an hour.

I opened #20535 for this.

comment:26 Changed 10 months ago by teor

Ok, and finally, for the bahaviour at the failure limit, I opened #20536.

Some of these will also interact with #19969, we should consider both tickets when doing the fixes.

comment:27 Changed 9 months ago by nickm

What's the current consensus on what the minimal set of fixes for 029 are here? I'd like to do something for the next couple of days for an 0.2.9.5-alpha, even if we expect that we'll have to fine-tune it a little more for 0.2.9.6-rc.

Should we need to do all of the subtickets in 0.2.9, do you think? Or do we take a simpler approach?

comment:28 Changed 9 months ago by nickm

I have a branch bug20499_part1_029 that I think should be sufficient to make 0.2.9 work correctly here. It solves #20534 and #20536 (and #20587 too, since it was there). The other stuff, I think, could wait to 0.3.0? My understanding is limited though.

comment:29 Changed 9 months ago by nickm

Status: newneeds_review

comment:30 in reply to:  21 ; Changed 9 months ago by teor

Code review of 20499_part1_029:

I'm surprised your compiler didn't warn about the assignment here:

if (dls->backoff = DL_SCHED_DETERMINISTIC)

Replying to arma:

...
It seems to me that any design that effectively has a "now you won't ask for the consensus anymore" possible outcome is a scary one here.

We should think carefully about what we want the default maximum to be, and if we want it to be the same in every case:

*max = INT_MAX;

Perhaps the network could deal with slow zombies if they only retried every day/week/month. Or perhaps we really do want them to stop asking. Or perhaps we want clients to do one thing, and relays another.

Currently, some schedules do have INT_MAX as their maximum, others have 2, 12, or 73 hours.

All the other commits look fine, but I'm not sure they're enough to solve this issue.

comment:31 Changed 9 months ago by teor

Keywords: CoreTorTeam201611 added
Status: needs_reviewneeds_revision

As well as the commits in 20499_part1_029, I think we should also merge #20533 to fix the one-download-multiple-failure case.

We could also merge #20591, I don't think it's causing this particular issue, but it could make it worse if the bug is ever triggered.

If we really do want every failure to result in a schedule increment, we have to remove the following code:

download_status_increment_failure:

  /* only count the failure if it's permanent, or we're a server */
  if (status_code != 503 || server) {
    if (dls->n_download_failures < IMPOSSIBLE_TO_DOWNLOAD-1)
      ++dls->n_download_failures;
  }

Because combined with this code in download_status_schedule_get_delay, it causes a reset on every 503:

    /* Check if we missed a reset somehow */
    if (dls->last_backoff_position > dls_schedule_position) {
      dls->last_backoff_position = 0;
      dls->last_delay_used = 0;
    }

Which is exactly what we don't want when relays are busy - imagine clients doing an automatic reset every time they DoS a relay...

comment:32 in reply to:  30 Changed 9 months ago by nickm

Replying to teor:

Code review of 20499_part1_029:

I'm surprised your compiler didn't warn about the assignment here:

if (dls->backoff = DL_SCHED_DETERMINISTIC)

Yikes. Added a fixup commit for that.

Replying to arma:

...
It seems to me that any design that effectively has a "now you won't ask for the consensus anymore" possible outcome is a scary one here.

We should think carefully about what we want the default maximum to be, and if we want it to be the same in every case:

*max = INT_MAX;

Perhaps the network could deal with slow zombies if they only retried every day/week/month. Or perhaps we really do want them to stop asking. Or perhaps we want clients to do one thing, and relays another.

Currently, some schedules do have INT_MAX as their maximum, others have 2, 12, or 73 hours.

My thinking here was that we're backing off to infinite retries, so we might as well let the delays get arbitrarily large. We might want to cut the maximum back down if it turns out to be a problem in practice, but I'd like to be cautious about zombies for now.

All the other commits look fine, but I'm not sure they're enough to solve this issue.

Me neither!

comment:33 Changed 9 months ago by nickm

Squashing and merging my patch above along with #20591 and #20533.

comment:34 in reply to:  31 Changed 9 months ago by nickm

Owner: set to nickm
Status: needs_revisionaccepted

Added #20593 for the reset-on-503 issue above.

comment:35 Changed 9 months ago by teor

I have sketched out some minor tweaks to the exponential backoff parameters in #20534. This makes the download timings much more like 0.2.8.

comment:36 Changed 9 months ago by nickm

Okay. We've merged a fairly large passel of fixes to try to address this. It is reasonably likely that we will need more finetuning, so this will be one of the things that makes 0.2.9.5-alpha an alpha release.

comment:37 Changed 9 months ago by teor

Parent ID: #20534
Resolution: fixed
Status: acceptedclosed

Fixed, reparenting to #20534 for documentation

Note: See TracTickets for help on using tickets.