Opened 4 years ago

Closed 4 years ago

#18816 closed defect (fixed)

We still wait 120 seconds for cert fetches from missing dir mirrors

Reported by: arma Owned by:
Priority: Medium Milestone: Tor: 0.2.8.x-final
Component: Core Tor/Tor Version: Tor: 0.2.8.1-alpha
Severity: Normal Keywords: must-fix-before-028-rc, CoreTorTeam201605, review-group-1
Cc: Actual Points:
Parent ID: Points: small
Reviewer: nickm Sponsor:

Description

In #4483 and prop210 we set up an elaborate download schedule for consistently reaching fallbackdirs when fetching the consensus, so we don't end up just sitting there for 120 seconds while a tcp connection waits (and eventually the SocksTimeout parameter is reached and we move on).

But we didn't do any similar thing with fetching the key certs. I just had my bootstrap go smoothly through the #4483 features (with the fixes from #18809) and then it stalled for 2 minutes trying to fetch the certs from a fallbackdir that's offline.

Sure enough, in authority_certs_fetch_missing() I see

      /* XXX - do we want certs from authorities or mirrors? - teor */
      directory_get_from_dirserver(DIR_PURPOSE_FETCH_CERTIFICATE, 0,
                                   resource, PDS_RETRY_IF_NO_SERVERS,
                                   DL_WANT_ANY_DIRSERVER);

So teor noticed this one too.

I think in 0.2.8, if we leave the fallbackdir stuff in (meaning we merge #18809 or equivalent into 0.2.8), we could bandage this one by changing DL_WANT_ANY_DIRSERVER to DL_WANT_AUTHORITY, and then it wouldn't be much worse than it is now (in terms of performance -- we would indeed lose the ability to bootstrap from scratch when the authorities are unavailable).

Longer term (0.2.9 and later), I think we should explore a) having directory_get_from_dirserver() notice that there are tls conns established to dir mirrors that we just recently used (and prefer them), or b) trying to explicitly remember the dir mirror that gave us the consensus and re-use it, and/or c) designing a piggy-back mechanism so we can ask for "the certs that go with this consensus" when we're fetching a consensus and we know we will want the certs for it too (thus saving a round-trip).

Child Tickets

Change History (18)

comment:1 Changed 4 years ago by nickm

Points: medium

comment:2 Changed 4 years ago by nickm

Keywords: 029-nickm-unsure added

Marking these tickets as the ones I think I need more feedback about in order to figure out if I think it should go in 0.2.9.

comment:3 Changed 4 years ago by teor

Keywords: must-fix-before-028-rc added
Milestone: Tor: 0.2.???Tor: 0.2.8.x-final
Version: Tor: 0.2.8.1-alpha

We have to fix at least part of this in 0.2.8, or the fallback feature will break as soon as a fallback is down.

comment:4 in reply to:  description Changed 4 years ago by teor

Points: mediumsmall
Status: newneeds_review

Replying to arma:

In #4483 and prop210 we set up an elaborate download schedule for consistently reaching fallbackdirs when fetching the consensus, so we don't end up just sitting there for 120 seconds while a tcp connection waits (and eventually the SocksTimeout parameter is reached and we move on).

But we didn't do any similar thing with fetching the key certs. I just had my bootstrap go smoothly through the #4483 features (with the fixes from #18809) and then it stalled for 2 minutes trying to fetch the certs from a fallbackdir that's offline.

Sure enough, in authority_certs_fetch_missing() I see

      /* XXX - do we want certs from authorities or mirrors? - teor */
      directory_get_from_dirserver(DIR_PURPOSE_FETCH_CERTIFICATE, 0,
                                   resource, PDS_RETRY_IF_NO_SERVERS,
                                   DL_WANT_ANY_DIRSERVER);

So teor noticed this one too.

I think in 0.2.8, if we leave the fallbackdir stuff in (meaning we merge #18809 or equivalent into 0.2.8), we could bandage this one by changing DL_WANT_ANY_DIRSERVER to DL_WANT_AUTHORITY, and then it wouldn't be much worse than it is now (in terms of performance -- we would indeed lose the ability to bootstrap from scratch when the authorities are unavailable).

This could be caused by an existing bug in download_status_reset_by_sk_in_cl, download_status_is_ready_by_sk_in_cl, and get_cert_list, where we don't consistently initialise the schedule.
Some certificate fetches get the generic schedule, others get the consensus schedule.
Some certificate fetches initialise the schedule, others don't bother.

See f2e9af1 Use the consensus download schedule for authority certificates in my branch bug18816 on https://github.com/teor2345/tor.git

With this fix, tor will use ClientBootstrapConsensusFallbackDownloadSchedule for certificate fetches, which starts 0, 1, 4, 11, yielding 4 tries for each certificate in the first 16 seconds. The limit TestingCertMaxDownloadTries is 8, which we'll never reach in any reasonable time.

This is still not great if a significant number of fallbacks are blocked or blackholed.

So let's try an authority to start with, then try a fallback if the authority fails (this is better than 0.2.7, but still has issues with blackholed authorities and fallbacks).

87fdbb6 Switch between fallback and authority when auth cert fetch fails
bea0819 fixup! Switch between fallback and authority when auth cert fetch fails
(Try an authority first, because they're more likely to work first time than a fallback.)

There are a few options to deal with blackholed fallbacks and authorities:

  • do what we do with the consensus, and try multiple, simultaneous connections to both authorities and fallback directories, use the first one that succeeds, and close the rest,
  • if the connection to a fallback fails, try an authority (this still doesn't help with blackholed fallbacks),
  • or any of the other options arma mentions:

Longer term (0.2.9 and later), I think we should explore a) having directory_get_from_dirserver() notice that there are tls conns established to dir mirrors that we just recently used (and prefer them), or b) trying to explicitly remember the dir mirror that gave us the consensus and re-use it, and/or c) designing a piggy-back mechanism so we can ask for "the certs that go with this consensus" when we're fetching a consensus and we know we will want the certs for it too (thus saving a round-trip).

I've split these off into #18963 so we can deal with them, maybe in 0.2.9.

comment:5 Changed 4 years ago by nickm

Keywords: 029-proposed 029-nickm-unsure removed

Since this is now with-code in 0.2.8, I'm yanking it out of 029-proposed entirely.

comment:6 Changed 4 years ago by nickm

Reviewer: nickm

f2e9af1b859287d24201a65213ab6ad5362e9c99 --

  • looks fine, though the comment was a little unclear to me. I'll fix that though. nickm -- clarify the comment on download_status_cert_init to clarify that download_status_cert_init() is _not_ okay to call after the first failure has occurred.

87fdbb6 and bea0819 --

  • I'm not sure about these, because they bring back the presence of authorities as client enumerators. With these patches, since every fallback will go down _sometimes_, every client will try an authority _sometimes_, and the authorities will still be able to enumerate them. What if, as an alternative solution, we just use a sdifferent schedule for certificate downloads? Are they still really necessary with f2e9af1 ?

comment:7 in reply to:  6 Changed 4 years ago by teor

Status: needs_reviewneeds_information

Replying to nickm:

f2e9af1b859287d24201a65213ab6ad5362e9c99 --

  • looks fine, though the comment was a little unclear to me. I'll fix that though. nickm -- clarify the comment on download_status_cert_init to clarify that download_status_cert_init() is _not_ okay to call after the first failure has occurred.

87fdbb6 and bea0819 --

  • I'm not sure about these, because they bring back the presence of authorities as client enumerators. With these patches, since every fallback will go down _sometimes_, every client will try an authority _sometimes_, and the authorities will still be able to enumerate them. What if, as an alternative solution, we just use a sdifferent schedule for certificate downloads? Are they still really necessary with f2e9af1 ?

I guess I worry about a situation where all the fallbacks are blocked, but the authorities are not. Then Tor 0.2.8 won't be able to bootstrap reliably, where 0.2.7 could. (I know this is rare.)

And even if fallbacks don't go down, the authorities are still present in the list of fallbacks. With the current 100 fallbacks at weight 10 and 9 authorities at weight 1, clients will try authorities 9 / 1009 ~= 1% of the time.

That said, these certificate fetches only happen rarely - once on initial client bootstrap, and once each time a certificate expires. If a client has a valid consensus when a certificate expires, it fetches the certificates from a random directory in the consensus.

So the overall chance of any client contacting an authority is:

  • 1% on initial bootstrap,
  • 1% each time its consensus stops being reasonably live (if tor isn't run for 72 hours)
  • on authority certificate expiry:
    • 1% if the consensus is not reasonably live
    • vanishingly small with a live consensus (the authorities I checked have consensus weight 20)

We can do a few things to fix this:

  • don't fall back to an authority, always use fallbacks (revert 87fdbb6 and bea0819)
  • fall back to a fallback first, then an authority (revert bea0819)
  • merge #18963 to re-use the same directory that gave us the consensus for certificates
  • weight the fallbacks higher (but equally) so the authorities see less than 1% of traffic

I like a combination of:

  • fall back to an authority only after trying the fallback that gave you the consensus (#18963), and another fallback (merge #18963, revert bea0819)
  • decide how much traffic we want the authorities to get, and weight the fallbacks accordingly
    • one in a thousand would need a weight of 100
    • one in a million would need a weight of 100,000

Thoughts?

comment:8 Changed 4 years ago by arma

Separate from the short-term fixes here, I filed a long-term improvement as #18987.

comment:9 Changed 4 years ago by teor

Status: needs_informationneeds_review

I need an opinion from nickm and/or arma on this.

comment:10 Changed 4 years ago by teor

Status: needs_reviewneeds_revision

arma says that an authority seeing 0.1% of clients is ok, and much better than 1/9th.
He thinks that we should try:

  • the directory we just fetched from (the hint),
  • a fallback,
  • an authority.

comment:11 Changed 4 years ago by teor

Keywords: CoreTorTeam201605 added
Status: needs_revisionneeds_review

My branch bug18816 on https://github.com/teor2345/tor.git has the following additional commits:

973b245 Revert "fixup! Switch between fallback and authority when auth cert fetch fails"

  • this makes us try the recent successful dir server, then a fallback, then an authority, reducing the number of clients authorities can enumerate
  • fix NM2 in comment 6

b29bc39 Clarify when it is safe to call download_status_cert_init

  • it's only safe to call download_status_cert_init on a dlstatus that hasn't been used for downloads
  • fix NM1 in comment 6 (nickm said he'd fix it, but I was making changes anyway)
  • comment-only change

My branch bug18816-squashed:

  • rebases on the latest maint-0.2.8,
  • removes bea0819 and its matching revert 973b245,
  • squashes b29bc39 (which is really a fixup) into f2e9af1
Last edited 4 years ago by teor (previous) (diff)

comment:12 Changed 4 years ago by nickm

Keywords: review-group-1 added

comment:13 Changed 4 years ago by nickm

I just merged this under the impression that arma and teor had talked about my open question wrt the static variable. But arma and teor are talking about it now. This might need an additional tweak.

comment:14 Changed 4 years ago by nickm

Status: needs_reviewneeds_revision

(I merged bug18816-squashed to maint-0.2.8. More work may be needed depending on what they decide.)

comment:15 Changed 4 years ago by teor

Status: needs_revisionmerge_ready

Please see my branch bug18816_simplify in which I revert "Switch between fallback and authority when auth cert fetch fails", and change the comments to confirm that's what we want.

arma says this is unnecessary complexity, because after #18963, failing to fetch certificates only occurs when a connection fails or tor quits after a downloading a consensus but before downloading enough certificates, and it is a transient issue that will go away with the next successful certificate or consensus fetch.

comment:16 Changed 4 years ago by nickm

Merged bug18816_simplify! We can close once child ticket is closed.

comment:17 Changed 4 years ago by nickm

This ticket keeps confusing me by being open. I'm unparenting the child ticket as it waits for review, then closing this ticket.

comment:18 Changed 4 years ago by nickm

Resolution: fixed
Status: merge_readyclosed
Note: See TracTickets for help on using tickets.