When an authority is down, a client bootstrapping into the network might pick it to learn the consensus, and then have to wait until that attempt times out.
We should try to improve the robustness of the authorities. Fine, but easier said than done.
At the same time, we should explore letting clients time out more quickly during critical-path directory requests.
We should make sure not to screw it up for clients that are on genuinely slow and high-latency connections. The goal is to improve time-to-bootstrap for the average case while not harming it (much) for the already bad cases.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
There seem to be two high-level ways to solve this issue:
Reduce the timeout for dirport connection establishment to be very small.
Ask k directory requests at once (perhaps until we start getting answers, then slow down).
Are both good ideas, assuming we could do them right? Can we do the first easier as the second?
Is some form of the second one already done? I swear I've seen my client do multiple tunneled dir conns in the Vidalia window and on the control port... If it is done, why does this bug still cause us problems? Is bootstrapping somehow different?
There seem to be two high-level ways to solve this issue:
Reduce the timeout for dirport connection establishment to be very small.
Ask k directory requests at once (perhaps until we start getting answers, then slow down).
Are both good ideas, assuming we could do them right? Can we do the first easier as the second?
Reducing the timeout would break Tor completely. See #1297 (moved) (comment 11 and below) for how I cleaned up one part of the mess you made the last time you had Tor's timeouts lowered.
Is some form of the second one already done? I swear I've seen my client do multiple tunneled dir conns in the Vidalia window and on the control port... If it is done, why does this bug still cause us problems? Is bootstrapping somehow different?
Tor performs descriptor (and microdescriptor) fetches over many directory connections in parallel. (Too many -- arma may have opened a ticket for that.) As far as I know, it does not launch multiple attempts to fetch a consensus in parallel.
Hrmm.. In order to support parallel consensus downloads, we probably need HTTP 1.1 range requests, which don't seem to be supported by the dirauths (who only implement HTTP 1.0).
How about this then: Bootstrapping clients make T concurrent test connections to a random selection of the N dirauths and M directory mirrors from #572 (moved). When the first connection succeeds, they close the remaining T-1 test connections, download the consensus through it, and then close it. No timeout changes, and we solve the "k nodes are down" bootstrap problem.
We should also ensure that in all cases, some fixed fraction of the T test connections go to the M directory mirrors, because the whole reason this issue is important is because we want to the ability to enter into a "panic mode" if the dirauths are under a resource exhaustion attack or other DoS condition (such as the dirauth crash in the description of #2664 (moved)).
Does this need a proposal to clarify and debate, or is this plan good enough to begin work on?
I had been imagining a design where we launch a further consensus fetch request every 5 seconds if the current one(s) haven't established a connection; but we don't hang up on the earlier ones until a much larger timeout. That way in the good case we don't make any more connections than now, but we failover much faster.
I had been imagining a design where we launch a further consensus fetch request every 5 seconds if the current one(s) haven't established a connection; but we don't hang up on the earlier ones until a much larger timeout. That way in the good case we don't make any more connections than now, but we failover much faster.
But in the bad case where something like the #2664 (moved) crash, a dirauth censorship event, or a #3023 (moved)-related circuit-based attack destroys dirauth reachability, new clients will still have to wait up to 5*N seconds before even attempting a dir mirror. Since for some reason we want to keep adding new dirauths, we'll soon be back up in the O(minutes) delay range with this plan in these circumstances. It might get even worse than that, if a large number of our #572 (moved) fallback dir mirrors end up unreachable.
According to http://petsymposium.org/2012/papers/hotpets12-1-usability.pdf, even 30 seconds of browser launch delay is enough time for 40% of the population to either give up or to assume magic has happened and Tor is already working somehow, even with progress bar activity...
okay. Send it to tor-dev when you think it's ready to be a proposal, like 001-process.txt says. One section I'd suggest you add would be load/performance analysis: how much could this hurt? Is there something else that doesn't hurt so much?
Kicking this into "Tor: unspecified," since it is probably not going to get done by 10 Dec, and it seems very likely to have the traits of a "large feature".
I remain excited for somebody to work on this one.
I think it seems clear that we're going to have to launch multiple connections -- that is, that we'll need to launch a second connection even while the first one hasn't finished yet. That's just a fact because of long tcp timeouts. (In particular, consider networks where the active "it didn't work" tcp packets don't make it back to the user.)
Given that, I think Mike's suggested approach in proposal 210 is a fine one.
I especially like that we're retaining a connection attempt to the directory authorities -- since they're the only ones that cause a warn-level log when your clock appears wrong. I believe, but somebody should check, that clients get this warn-level log simply when they finish the TLS connection to the authority, because of the netinfo cell, right? (Notifying users about wrong clocks is one of the sticking points on deploying the current implementation of FallbackDirs (until #2628 (moved) is done).
One ambiguity in the proposal: say we've launched our three connections, one to an authority and two to fallbackdirs, and it's time to launch three more. Do we do the same patterns, one to a new authority and two to new fallbackdirs? Or do we say we've got one that's to an authority, and that's good enough, and pick three new from the fallbackdirs? My guess is that the proposal suggests the former, but I think it's up for grabs which is wiser. Probably still the former.
As for implementation, I think we could trigger channel attempts for the three chosen relays, and then have a function that gets called whenever a channel finishes being established, and in that function see if we're in a situation where we want to launch a directory fetch. Should be pretty clean.
I remain excited for somebody to work on this one.
I think it seems clear that we're going to have to launch multiple connections -- that is, that we'll need to launch a second connection even while the first one hasn't finished yet. That's just a fact because of long tcp timeouts. (In particular, consider networks where the active "it didn't work" tcp packets don't make it back to the user.)
Given that, I think Mike's suggested approach in proposal 210 is a fine one.
Agreed. I can imagine small tweaks to the proposal, where instead of launching 3 connections at once, we launch one connection immediately, then another connection if the first hasn't gotten an answer in a second or two, then a third connection a second or two after that.
I especially like that we're retaining a connection attempt to the directory authorities -- since they're the only ones that cause a warn-level log when your clock appears wrong. I believe, but somebody should check, that clients get this warn-level log simply when they finish the TLS connection to the authority, because of the netinfo cell, right? (Notifying users about wrong clocks is one of the sticking points on deploying the current implementation of FallbackDirs (until #2628 (moved) is done).
I believe that's all correct.
One ambiguity in the proposal: say we've launched our three connections, one to an authority and two to fallbackdirs, and it's time to launch three more. Do we do the same patterns, one to a new authority and two to new fallbackdirs? Or do we say we've got one that's to an authority, and that's good enough, and pick three new from the fallbackdirs? My guess is that the proposal suggests the former, but I think it's up for grabs which is wiser. Probably still the former.
Hmm. I kinda like the latter, but I don't have strong feelings there. It would be nice to make an argument either way.
As for implementation, I think we could trigger channel attempts for the three chosen relays, and then have a function that gets called whenever a channel finishes being established, and in that function see if we're in a situation where we want to launch a directory fetch. Should be pretty clean.
I remain excited for somebody to work on this one.
I think it seems clear that we're going to have to launch multiple connections -- that is, that we'll need to launch a second connection even while the first one hasn't finished yet. That's just a fact because of long tcp timeouts. (In particular, consider networks where the active "it didn't work" tcp packets don't make it back to the user.)
Given that, I think Mike's suggested approach in proposal 210 is a fine one.
Agreed. I can imagine small tweaks to the proposal, where instead of launching 3 connections at once, we launch one connection immediately, then another connection if the first hasn't gotten an answer in a second or two, then a third connection a second or two after that.
I've revised proposal #210 (moved) to implement this exponential backoff scheme, will still maintaining the same number of connections within the first second.
Please see my branch bootstrap-exponential-backoff in https://github.com/teor2345/torspec.git
I especially like that we're retaining a connection attempt to the directory authorities -- since they're the only ones that cause a warn-level log when your clock appears wrong. I believe, but somebody should check, that clients get this warn-level log simply when they finish the TLS connection to the authority, because of the netinfo cell, right? (Notifying users about wrong clocks is one of the sticking points on deploying the current implementation of FallbackDirs (until #2628 (moved) is done).
I believe that's all correct.
I've specified a retry schedule that we make 1 authority connection and 2 directory mirror connections in the first second. This will:
place a similar connection load (TLS) on the authorities
reduce download load (data) on the authorities if the mirror (or first few mirrors) are faster
continue to warn the user if their clock is wrong by making sure we try to complete at least one authority TLS connection every time we launch (then close it if we are already downloading the directory)
other connections are closed as soon as a consensus download starts
One ambiguity in the proposal: say we've launched our three connections, one to an authority and two to fallbackdirs, and it's time to launch three more. Do we do the same patterns, one to a new authority and two to new fallbackdirs? Or do we say we've got one that's to an authority, and that's good enough, and pick three new from the fallbackdirs? My guess is that the proposal suggests the former, but I think it's up for grabs which is wiser. Probably still the former.
Hmm. I kinda like the latter, but I don't have strong feelings there. It would be nice to make an argument either way.
In the updated scheme, we connect to mirrors at a faster rate than authorities, approximately 5 mirror connections for every authority connection. (And even fewer if the connections succeed.)
As for implementation, I think we could trigger channel attempts for the three chosen relays, and then have a function that gets called whenever a channel finishes being established, and in that function see if we're in a situation where we want to launch a directory fetch. Should be pretty clean.
Sounds good, I'll try to make it work along with #15775 (moved).
I've also added a section on trying both IPv4 and IPv6 addresses. I think this will allow us to implement IPv6 bootstrap along with #8374 (moved).
Starting to review v10, since that's what I see...
(Please do not do a rebase that changes stuff in these commits; instead, make fixup! or squash! commits so that I can review the changes and won't have to re-review what I've already been over.)
I am not sure about the must/should distinction at all. We'd like to avoid loops, sure, but I kinda feel like it isn't really true that a fallbackdir must never ask another fallback dir for information. Must think more here.
maybe a router_digest_is_fallback_dir function would be handy to have.
607fdd0a52f4ee41fe0cc27a102db53681a6aff4 (Refactor connection_get_* to produce lists and counts)
There is a huge amount of duplicate code here. Having the same 12-line function written more than once is just asking for trouble IMNSHO.
strcmp_opt() could simply two of these functions.
In connection.h, you're calling free(conns), not smartlist_free(conns).
88821fe930d5275eadec3125e097674124477b30 (Add want_authority to directory_get_from_dirserver)
Are the only options really DL_WANT_AUTHORITY and DL_WANT_FALLBACK ? Shouldn't there be a third state for "any cache would be okay" ? Or am I not understanding right? (In that case, please consider making the documentation more idiot-friendly)
Starting to review v10, since that's what I see...
(Please do not do a rebase that changes stuff in these commits; instead, make fixup! or squash! commits so that I can review the changes and won't have to re-review what I've already been over.)
I have added commits to feature4483-v10. (Please ignore feature4483-v11.)
I am not sure about the must/should distinction at all. We'd like to avoid loops, sure, but I kinda feel like it isn't really true that a fallbackdir must never ask another fallback dir for information. Must think more here.
I'm not convinced it's needed either, I think the current behaviour is acceptable. (See my detailed comments in #17574 (moved).)
We might want to explore it if we decide to make directory mirrors fetch from fallbacks (#17709 (moved)).
Reverted in 69faddc, which also reverts instances of should_fetch in subsequent commits.
Reverted in a22cfec, as we're modifying this message as part of the refactoring in #17739 (moved).
maybe a router_digest_is_fallback_dir function would be handy to have.
Yes, we'll definitely need this at some point.
Added in 76c5b53.
607fdd0a52f4ee41fe0cc27a102db53681a6aff4 (Refactor connection_get_* to produce lists and counts)
See fixup commit f9f03c6 for all the fixes on this commit.
There is a huge amount of duplicate code here. Having the same 12-line function written more than once is just asking for trouble IMNSHO.
I've removed the duplicate code in the get* functions, by using template macros.
This adds an extra check to connection_get_by_global_id, but doesn't change the behaviour of tor, because its only caller also makes this check.
strcmp_opt() could simply two of these functions.
I've modified those functions to use strcmp_opt.
In connection.h, you're calling free(conns), not smartlist_free(conns).
I've removed the duplicate code in the header, by using a template macro.
This also fixes the free() issue.
Oh, and the unit tests still pass, so I didn't break anything in these transformations.
88821fe930d5275eadec3125e097674124477b30 (Add want_authority to directory_get_from_dirserver)
Are the only options really DL_WANT_AUTHORITY and DL_WANT_FALLBACK ? Shouldn't there be a third state for "any cache would be okay" ? Or am I not understanding right? (In that case, please consider making the documentation more idiot-friendly)
Ugh, I was so stuck in the bootstrap world I baked it into the name!
See my fixup 8b5085a, where I change all DL_WANT_FALLBACK to DL_WANT_ANY_DIRSERVER, and add appropriate comments. In particular, while I could have changed launch_dummy_descriptor_download_as_needed() to use an authority to get a more trusted source for its IP address, I don't like the idea of some relays contacting an authority every 20 minutes. (So I left the previous behaviour intact.)
download_status_schedule_helper dereferences dls before it does tor_assert(dls); coverity won't like that, I think.
download_status_schedule_helper seems like a bad name. It doesn't have a verb. Maybe it should be ..._get_interval?
Somewhere, we need to explain how a dls actually works -- what it means, how to interpret it. I don't think the documentation in or.h really explains. Honestly, I'm not sure how this commit works. I think that an expanded comment on download_schedule_t type might be in order.
good to see tests though.
3244f51e18832cd507957a589e89452bd98164b3 (Multiple simultaneous connections for clients)
N_CONSENSUS_BOOTSTRAP_SOURCES concerns me. What if we think (someday) that we should try three fallbacks and an authority? Or that we should try an authority only after three fallbacks have failed? How much code would we need to discard?
download_status_schedule_helper dereferences dls before it does tor_assert(dls); coverity won't like that, I think.
Oops, but by the power of C99, I can swap those lines.
download_status_schedule_helper seems like a bad name. It doesn't have a verb. Maybe it should be ..._get_interval?
It's actually getting a delay, so I have modified the name of the function and the variables in the function to reflect this.
Somewhere, we need to explain how a dls actually works -- what it means, how to interpret it. I don't think the documentation in or.h really explains. Honestly, I'm not sure how this commit works. I think that an expanded comment on download_schedule_t type might be in order.
I have updated the comments on:
download_schedule_t
download_want_authority_t
download_schedule_increment_t
download_status_t
download_status_increment_failure
I hope this explains how the code works better, let me know if there's anything that could be clearer.
I also made sure n_download_attempts is incremented for all schedules. It doesn't make sense to only track it for attempt-based schedules.
3244f51e18832cd507957a589e89452bd98164b3 (Multiple simultaneous connections for clients)
See fixup cc6b228 for the change for this commit.
N_CONSENSUS_BOOTSTRAP_SOURCES concerns me. What if we think (someday) that we should try three fallbacks and an authority? Or that we should try an authority only after three fallbacks have failed? How much code would we need to discard?
TL;DR: We currently try 3 fallbacks and 2 authorities, if we need to.
Sorry for naming N_CONSENSUS_BOOTSTRAP_SOURCES badly, it's actually the number of different schedules used to download a document, not the number of sources.
(Modified the name to N_CONSENSUS_BOOTSTRAP_SCHEDULES.)
I think that this commit is unclear because the previous commit was poorly documented.
The schedules determine how many authorities and fallbacks are attempted, and in which order.
This means that:
A fallback is tried immediately.
If a connection hasn't been successful after 1 second, another fallback is tried.
If a connection hasn't been successful after 5 seconds (1+4), another fallback is tried.
If a connection hasn't been successful after 10 seconds, an authority is tried.
If a connection hasn't been successful after 16 seconds (1+4+11), another fallback is tried.
If a connection hasn't been successful after 21 seconds (10+11), another authority is tried.
(We can tune the schedules to manage the tradeoff between fallback load, authority load, and bootstrap success rate.)
86ac0928577c18a9e27c6c4aae24abf9478ffcfa -- still have to review this but...
Is return; really the right thing to do in these directory_initiate_command_rend cases? Why?
"return" has the effect of cancelling the current connection attempt, which is what we want to do if the connection was an excess connection.
It's what the function already does when the connection fails (connection_connect() returns -1 or !linked_conn), and in all the other error cases.
However, it doesn't make sense to make a connection, then immediately close it. So I created connection_dir_avoid_extra_connection_for_purpose() and call it at the top of the function. I updated the unit tests (including fixing a bug in the unit tests) and updated some comments as well.
I also made sure n_download_attempts is incremented for all schedules. It doesn't make sense to only track it for attempt-based schedules.
But I didn't update the unit tests for that. And, while I was updating the unit tests, I discovered I was incrementing it past IMPOSSIBLE_TO_DOWNLOAD. Which is bad.
Okay, I think I get it. Thanks for all the replies and comments; this patch is looking good.
Last questions:
How have you tested this in the wild? How well does it do? How sure are you that it doesn't open a lot of extra connections, and that it actually behaves as you think it does?
Could you do one last rebase to squash the fixup! commits and remove the reverted commits? I could try to do that, but I got conflicts when I did, and I'm worried about getting it wrong.