A running Tor won't update the microdesc consensus

changed milestone to %Tor: 0.2.9.x-final

added CoreTorTeam201611 component::core tor/tor milestone::Tor: 0.2.9.x-final owner::nickm parent::20534 priority::high regression reporter::rubiate resolution::fixed severity::normal status::closed type::defect version::tor 0.2.9.4-alpha labels

Trac:
Priority: Medium to High
Keywords: N/A deleted, regression added
Milestone: N/A to Tor: 0.2.9.x-final

This looks very similar to my "mystery 1" in #19969 (moved) -- so I am going to bring that discussion over here so we can keep the bugs separate.

For me it happened on "Tor 0.2.9.3-alpha-dev (git-bfaded9143d127cb)" (which is alas not in a released version of Tor, because #20269 (moved) isn't merged yet, but suffice to say it's partway between 0.2.9.3 and 0.2.9.4). And this was just a client, not a bridge or relay.

teor asked in #19969 (moved) what my "consensus download_status_t has for all its fields, particularly the attempt and failure counts." Here is what gdb says:

(gdb) print consensus_dl_status
$8 = {{next_attempt_at = 0, n_download_failures = 0 '\000',
    n_download_attempts = 0 '\000', schedule = DL_SCHED_CONSENSUS,
    want_authority = DL_WANT_ANY_DIRSERVER,
    increment_on = DL_SCHED_INCREMENT_FAILURE,
    backoff = DL_SCHED_RANDOM_EXPONENTIAL, last_backoff_position = 0 '\000',
    last_delay_used = 0}, {next_attempt_at = 1477296555,
    n_download_failures = 17 '\021', n_download_attempts = 17 '\021',
    schedule = DL_SCHED_CONSENSUS, want_authority = DL_WANT_ANY_DIRSERVER,
    increment_on = DL_SCHED_INCREMENT_FAILURE,
    backoff = DL_SCHED_RANDOM_EXPONENTIAL, last_backoff_position = 17 '\021',
    last_delay_used = 387}}

The first half of that makes sense, since my Tor doesn't touch the non-microdesc consensus stuff. But for the second half... it looks like my next_attempt_at is ~6 days ago.

teor also asked if my Tor has marked each of the directory authorities down. I believe the answer is yes -- all the entries in trusted_dir_servers has is_running set to 0, except the Bifroest entry (which make sense). Here is an example to be thorough:

(gdb) print *(dir_server_t *)(trusted_dir_servers->list[0])
$16 = {
  description = 0x7f57f56c0770 "directory server \"moria1\" at 128.31.0.39:9131"
, nickname = 0x7f57f56c0750 "moria1",
  address = 0x7f57f56c0600 "128.31.0.39", ipv6_addr = {family = 0, addr = {
      dummy_ = 0, in_addr = {s_addr = 0}, in6_addr = {__in6_u = {
          __u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0,
            0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}}}, addr = 2149515303,
  dir_port = 9131, or_port = 9101, ipv6_orport = 0, weight = 1,
  digest = "\226\225\337\303_\376\270a2\233\237\032\260LF9p \316\061",
  v3_identity_digest = "Õ<86>Ñ<83>\t\336\324\315mW\301\217Û<97>\357\251m3\005f",
  is_running = 0, is_authority = 1, has_accepted_serverdesc = 0,
  type = (V3_DIRINFO | EXTRAINFO_DIRINFO | MICRODESC_DIRINFO), 
  addr_current_at = 1448282868, fake_status = {published_on = 0, 
    nickname = "moria1", '\000' <repeats 13 times>, 
    identity_digest = "\226\225\337\303_\376\270a2\233\237\032\260LF9p \316\061", descriptor_digest = '\000' <repeats 31 times>, addr = 2149515303, 
    or_port = 9101, dir_port = 9131, ipv6_addr = {family = 0, addr = {
        dummy_ = 0, in_addr = {s_addr = 0}, in6_addr = {__in6_u = {
            __u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 
              0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}}}, 
    ipv6_orport = 0, is_authority = 0, is_exit = 0, is_stable = 0, 
    is_fast = 0, is_flagged_running = 0, is_named = 0, is_unnamed = 0, 
    is_valid = 0, is_possible_guard = 0, is_bad_exit = 0, is_hs_dir = 0,
    is_v2_dir = 0, protocols_known = 0, supports_extend2_cells = 0,
    has_bandwidth = 0, has_exitsummary = 0, bw_is_unmeasured = 0,
    bandwidth_kb = 0, has_guardfraction = 0, guardfraction_percentage = 0,
    exitsummary = 0x0, last_dir_503_at = 0, dl_status = {next_attempt_at = 0,
      n_download_failures = 0 '\000', n_download_attempts = 0 '\000',
      schedule = DL_SCHED_GENERIC, want_authority = DL_WANT_ANY_DIRSERVER,
      increment_on = DL_SCHED_INCREMENT_FAILURE,
      backoff = DL_SCHED_DETERMINISTIC, last_backoff_position = 0 '\000',
      last_delay_used = 0}}}

teor also asked if my tor has marked all the fallback dirs down.

The fallback_dir_servers smartlist has 90 elements so I didn't check all of them, but for the first few, is_running was set to 1. That is, no, my Tor seemed to think that the fallback dirs were just fine to contact. It simply chose not to contact them.

I wonder if the new random_exponential_backoff code in 0.2.9 could be at issue.

(changing the title since this happened to my tor client too)

Trac:
Summary: Relay/Bridge won't update the microdesc consensus while running to A running Tor won't update the microdesc consensus

To give some other context to folks reading this ticket, here is some more debugging detail (all from my Tor client that has been opting not to retrieve a new consensus for the past week+):

I set a breakpoint on fetch_networkstatus_callback, and learned that prefer_mirrors is 1, and we_are_bootstrapping is 1. should_delay_dir_fetches() was 0. It called update_networkstatus_downloads as expected.

Then I set a breakpoint on update_consensus_networkstatus_downloads. Again we_are_bootstrapping is 1. use_multi_conn alas is , but looking at the function, I assume it's 1 for me. The first round through the loop, for vanilla consensus flavor, we_want_to_fetch_flavor() is no, so I move on to the second round through the loop. I don't know how c = networkstatus_get_latest_consensus_by_flavor(i); goes because "print c" also says , but it looks like it runs time_to_download_next_consensus[i] = now; next, so I can assume that c was NULL. Then it keeps going through the function until it calls update_consensus_bootstrap_multiple_downloads(). Looks plausible.

So I set a breakpoint on update_consensus_bootstrap_multiple_downloads, which was trickier than I would have wanted since it looks like my compiler inlined it into update_consensus_networkstatus_downloads. But it looks like it does make two calls to update_consensus_bootstrap_attempt_downloads -- one with dls_f, and the next with dls_a.

So I set a breakpoint on update_consensus_bootstrap_attempt_downloads. It sets max_dl_tries to 7, which makes sense since I see it there in config.c, set to default to 7.

Now it gets interesting:

(gdb) print *dls
$3 = {next_attempt_at = 1477295479, n_download_failures = 0 '\000', 
  n_download_attempts = 8 '\b', schedule = DL_SCHED_CONSENSUS, 
  want_authority = DL_WANT_ANY_DIRSERVER, 
  increment_on = DL_SCHED_INCREMENT_ATTEMPT, 
  backoff = DL_SCHED_RANDOM_EXPONENTIAL, last_backoff_position = 8 '\b', 
  last_delay_used = 7}

n_download_attempts is 8, and max_dl_tries is 7. I wonder where this is going!

Turns out download_status_is_ready() is no fun to gdb in because it's an inline tucked into directory.h, but looking at the code, it seems clear it returns 0, i.e. not ready. Then the function exits.

Then the same thing happens with the update_consensus_bootstrap_attempt_downloads call that was for dls_a.

Conclusion, I somehow failed 8 times to get a consensus, and now I will never try again.

Looking back at rubiate's analysis, my debugging matches theirs.

For thoroughness,

(gdb) print consensus_bootstrap_dl_status
$6 = {{next_attempt_at = 1477296321, n_download_failures = 0 '\000', 
    n_download_attempts = 8 '\b', schedule = DL_SCHED_CONSENSUS, 
    want_authority = DL_WANT_AUTHORITY, 
    increment_on = DL_SCHED_INCREMENT_ATTEMPT, 
    backoff = DL_SCHED_RANDOM_EXPONENTIAL, last_backoff_position = 8 '\b', 
    last_delay_used = 273}, {next_attempt_at = 1477295479, 
    n_download_failures = 0 '\000', n_download_attempts = 8 '\b', 
    schedule = DL_SCHED_CONSENSUS, want_authority = DL_WANT_ANY_DIRSERVER, 
    increment_on = DL_SCHED_INCREMENT_ATTEMPT, 
    backoff = DL_SCHED_RANDOM_EXPONENTIAL, last_backoff_position = 8 '\b', 
    last_delay_used = 7}}

#8625 (moved) has this choice quote: "I say we merge and wait to see if we get bugs reported?"

Easy answer is to revert 09a0f2d0b24 until we've designed a better download schedule and a better mechanism for testing it?

rubiate, are you able to reproduce this bug consistently? If so, can you spin up a relay or bridge with commit 09a0f2d0b24 reverted, and see how that fares? My guess is that it will fare much better.

In the mean time, I've opened #20501 (moved) to look at the Tor network for relays that were bitten by this bug (seems like quite a few), and #20509 (moved) for doing something about getting them off the network and/or taking away their Guard flag so clients don't get stuck behind them and then be unable to use Tor.

I concur that for 0.2.9, reverting 09a0f2d0b seems like a good choice. If we feel cheeky, we might increase the interval, though...

I set up two new relays:

bug20499bad ED6E43F07ABE87C017BD80D4BA24E41F8FF32E94 bug20499revert 083971FD18EDBD442DF0971D0FDC6F500642AD91

'bad is running 0.2.9.4-alpha, and 'revert is identical except it has 09a0f2d0b24 reverted. Both have the same config, other than cosmetic differences (different names, ports, logfile).

Rough timeline:

10:36 Both start up [...] both request the consensus every minute 10:41 they reach a fail count of 10 [...] both do nothing every minute, fail count is too high 11:36 'revert updates the microdesc consensus; valid-until 2016-11-01 14:00:00 11:36 'bad is still doing nothing, valid-until 2016-11-01 13:00:00 11:37 'revert tries to do it again, gets a 304, fail count is 2 11:38 'revert tries to do it again, gets a 304, fail count is 4 [...] and so on 11:41 'revert reaches a fail count of 10 again [...] both do nothing 12:36 'revert updates; valid-until 2016-11-01 15:00:00 12:36 'bad doesn't; valid-until 2016-11-01 13:00:00 12:41 'revert hits 10 fails again [...] doin' nothin' 13:00 'bad now has a microdesc consensus past its valid-until date [...] 13:36 'revert updates; valid-until 2016-11-01 16:00:00 13:36 'bad doesn't; valid-until 2016-11-01 13:00:00 [...] I suspect this pattern is going to hold.

'revert: http://45.32.188.229:9209/tor/status-vote/current/consensus-microdesc

'bad: http://45.32.188.229:9201/tor/status-vote/current/consensus-microdesc

Trac:
Username: rubiate

Perhaps max_dl_tries is also far too low, and we should increase it somewhat? 10?

So, if we're truly on exponential backoff, no maximum could be too large, right?

I also wonder, why are these failure counts so high?

Replying to nickm:

So, if we're truly on exponential backoff, no maximum could be too large, right?

Technically, yes.

But at some exponent, the wait time becomes indistinguishable from failure. (Which is why we need to make sure requests trigger a new attempt.)

I guess this essentially implements hibernate mode then?

And we could just put the failure count up to something quite high, let's say, at most, the failure number at which tor is waiting for the average time between tor stable releases?

I also wonder, why are these failure counts so high?

Firstly, because they get incremented twice for each failure.

download_status_increment_failure() gets called with a status_code of 304

update_consensus_networkstatus_downloads() gets called again, this time it stops at the call to connection_dir_count_by_purpose_and_resource() which returns 1 (equal to max_in_progress_conns)

download_status_increment_failure() gets called again, this time with a status_code of 0 (as a result each 304 response results in the fail count being increased by 2)

And secondly, because the laptop was offline for 12? hours?

Oh, and because we have bad directory mirrors serving out of date consensuses repeatedly?

A running Tor won't update the microdesc consensus

Child items ...

Activity