In #4483 (moved) and prop210 we set up an elaborate download schedule for consistently reaching fallbackdirs when fetching the consensus, so we don't end up just sitting there for 120 seconds while a tcp connection waits (and eventually the SocksTimeout parameter is reached and we move on).
But we didn't do any similar thing with fetching the key certs. I just had my bootstrap go smoothly through the #4483 (moved) features (with the fixes from #18809 (moved)) and then it stalled for 2 minutes trying to fetch the certs from a fallbackdir that's offline.
Sure enough, in authority_certs_fetch_missing() I see
/* XXX - do we want certs from authorities or mirrors? - teor */ directory_get_from_dirserver(DIR_PURPOSE_FETCH_CERTIFICATE, 0, resource, PDS_RETRY_IF_NO_SERVERS, DL_WANT_ANY_DIRSERVER);
So teor noticed this one too.
I think in 0.2.8, if we leave the fallbackdir stuff in (meaning we merge #18809 (moved) or equivalent into 0.2.8), we could bandage this one by changing DL_WANT_ANY_DIRSERVER to DL_WANT_AUTHORITY, and then it wouldn't be much worse than it is now (in terms of performance -- we would indeed lose the ability to bootstrap from scratch when the authorities are unavailable).
Longer term (0.2.9 and later), I think we should explore a) having directory_get_from_dirserver() notice that there are tls conns established to dir mirrors that we just recently used (and prefer them), or b) trying to explicitly remember the dir mirror that gave us the consensus and re-use it, and/or c) designing a piggy-back mechanism so we can ask for "the certs that go with this consensus" when we're fetching a consensus and we know we will want the certs for it too (thus saving a round-trip).
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
In #4483 (moved) and prop210 we set up an elaborate download schedule for consistently reaching fallbackdirs when fetching the consensus, so we don't end up just sitting there for 120 seconds while a tcp connection waits (and eventually the SocksTimeout parameter is reached and we move on).
But we didn't do any similar thing with fetching the key certs. I just had my bootstrap go smoothly through the #4483 (moved) features (with the fixes from #18809 (moved)) and then it stalled for 2 minutes trying to fetch the certs from a fallbackdir that's offline.
Sure enough, in authority_certs_fetch_missing() I see
{{{
/* XXX - do we want certs from authorities or mirrors? - teor */
directory_get_from_dirserver(DIR_PURPOSE_FETCH_CERTIFICATE, 0,
resource, PDS_RETRY_IF_NO_SERVERS,
DL_WANT_ANY_DIRSERVER);
}}}
So teor noticed this one too.
I think in 0.2.8, if we leave the fallbackdir stuff in (meaning we merge #18809 (moved) or equivalent into 0.2.8), we could bandage this one by changing DL_WANT_ANY_DIRSERVER to DL_WANT_AUTHORITY, and then it wouldn't be much worse than it is now (in terms of performance -- we would indeed lose the ability to bootstrap from scratch when the authorities are unavailable).
This could be caused by an existing bug in download_status_reset_by_sk_in_cl, download_status_is_ready_by_sk_in_cl, and get_cert_list, where we don't consistently initialise the schedule.
Some certificate fetches get the generic schedule, others get the consensus schedule.
Some certificate fetches initialise the schedule, others don't bother.
With this fix, tor will use ClientBootstrapConsensusFallbackDownloadSchedule for certificate fetches, which starts 0, 1, 4, 11, yielding 4 tries for each certificate in the first 16 seconds. The limit TestingCertMaxDownloadTries is 8, which we'll never reach in any reasonable time.
This is still not great if a significant number of fallbacks are blocked or blackholed.
So let's try an authority to start with, then try a fallback if the authority fails (this is better than 0.2.7, but still has issues with blackholed authorities and fallbacks).
87fdbb6 Switch between fallback and authority when auth cert fetch fails
bea0819 fixup! Switch between fallback and authority when auth cert fetch fails
(Try an authority first, because they're more likely to work first time than a fallback.)
There are a few options to deal with blackholed fallbacks and authorities:
do what we do with the consensus, and try multiple, simultaneous connections to both authorities and fallback directories, use the first one that succeeds, and close the rest,
if the connection to a fallback fails, try an authority (this still doesn't help with blackholed fallbacks),
or any of the other options arma mentions:
Longer term (0.2.9 and later), I think we should explore a) having directory_get_from_dirserver() notice that there are tls conns established to dir mirrors that we just recently used (and prefer them), or b) trying to explicitly remember the dir mirror that gave us the consensus and re-use it, and/or c) designing a piggy-back mechanism so we can ask for "the certs that go with this consensus" when we're fetching a consensus and we know we will want the certs for it too (thus saving a round-trip).
I've split these off into #18963 (moved) so we can deal with them, maybe in 0.2.9.
Trac: Status: new to needs_review Points: medium to small
looks fine, though the comment was a little unclear to me. I'll fix that though. nickm -- clarify the comment on download_status_cert_init to clarify that download_status_cert_init() is not okay to call after the first failure has occurred.
87fdbb6 and bea0819 --
I'm not sure about these, because they bring back the presence of authorities as client enumerators. With these patches, since every fallback will go down sometimes, every client will try an authority sometimes, and the authorities will still be able to enumerate them. What if, as an alternative solution, we just use a sdifferent schedule for certificate downloads? Are they still really necessary with f2e9af1 ?
looks fine, though the comment was a little unclear to me. I'll fix that though. nickm -- clarify the comment on download_status_cert_init to clarify that download_status_cert_init() is not okay to call after the first failure has occurred.
87fdbb6 and bea0819 --
I'm not sure about these, because they bring back the presence of authorities as client enumerators. With these patches, since every fallback will go down sometimes, every client will try an authority sometimes, and the authorities will still be able to enumerate them. What if, as an alternative solution, we just use a sdifferent schedule for certificate downloads? Are they still really necessary with f2e9af1 ?
I guess I worry about a situation where all the fallbacks are blocked, but the authorities are not. Then Tor 0.2.8 won't be able to bootstrap reliably, where 0.2.7 could. (I know this is rare.)
And even if fallbacks don't go down, the authorities are still present in the list of fallbacks. With the current 100 fallbacks at weight 10 and 9 authorities at weight 1, clients will try authorities 9 / 1009 ~= 1% of the time.
That said, these certificate fetches only happen rarely - once on initial client bootstrap, and once each time a certificate expires. If a client has a valid consensus when a certificate expires, it fetches the certificates from a random directory in the consensus.
So the overall chance of any client contacting an authority is:
1% on initial bootstrap,
1% each time its consensus stops being reasonably live (if tor isn't run for 72 hours)
on authority certificate expiry:
1% if the consensus is not reasonably live
vanishingly small with a live consensus (the authorities I checked have consensus weight 20)
We can do a few things to fix this:
don't fall back to an authority, always use fallbacks (revert 87fdbb6 and bea0819)
fall back to a fallback first, then an authority (revert bea0819)
merge #18963 (moved) to re-use the same directory that gave us the consensus for certificates
weight the fallbacks higher (but equally) so the authorities see less than 1% of traffic
I like a combination of:
fall back to an authority only after trying the fallback that gave you the consensus (#18963 (moved)), and another fallback (merge #18963 (moved), revert bea0819)
decide how much traffic we want the authorities to get, and weight the fallbacks accordingly
I just merged this under the impression that arma and teor had talked about my open question wrt the static variable. But arma and teor are talking about it now. This might need an additional tweak.
Please see my branch bug18816_simplify in which I revert "Switch between fallback and authority when auth cert fetch fails", and change the comments to confirm that's what we want.
arma says this is unnecessary complexity, because after #18963 (moved), failing to fetch certificates only occurs when a connection fails or tor quits after a downloading a consensus but before downloading enough certificates, and it is a transient issue that will go away with the next successful certificate or consensus fetch.