Opened 8 years ago

Closed 3 years ago

#4483 closed defect (implemented)

If k of n authorities are down, k/n bootstrapping clients are delayed for minutes

Reported by: arma Owned by: teor
Priority: High Milestone: Tor: 0.2.8.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: performance, bootstrap, dirauth-dos-resistance, tor-client, large-feature, prop210, 028-triage, TorCoreTeam201512, MikePerry201512R
Cc: Actual Points:
Parent ID: #2664 Points: medium/large
Reviewer: Sponsor: SponsorU

Description

When an authority is down, a client bootstrapping into the network might pick it to learn the consensus, and then have to wait until that attempt times out.

We should try to improve the robustness of the authorities. Fine, but easier said than done.

At the same time, we should explore letting clients time out more quickly during critical-path directory requests.

We should make sure not to screw it up for clients that are on genuinely slow and high-latency connections. The goal is to improve time-to-bootstrap for the average case while not harming it (much) for the already bad cases.

Child Tickets

TicketStatusOwnerSummaryComponent
#17716closedDo clients need to do authority clock checks?Core Tor/Tor

Attachments (1)

Fallback Calculations.pdf (16.0 KB) - added by teor 4 years ago.
Full fallback reliability calculations, ask teor for editable version

Download all attachments as: .zip

Change History (57)

comment:1 Changed 8 years ago by arma

Keywords: performance, bootstrapperformance bootstrap

comment:2 Changed 7 years ago by nickm

Milestone: Tor: 0.2.3.x-finalTor: 0.2.4.x-final

not gonna happen in 0.2.3. Maybe backportable if there turns out to be a good answer in 0.2.4.

comment:3 Changed 7 years ago by mikeperry

Keywords: dos-resistence added
Parent ID: #2664

There seem to be two high-level ways to solve this issue:

  1. Reduce the timeout for dirport connection establishment to be very small.
  2. Ask k directory requests at once (perhaps until we start getting answers, then slow down).

Are both good ideas, assuming we could do them right? Can we do the first easier as the second?

Is some form of the second one already done? I swear I've seen my client do multiple tunneled dir conns in the Vidalia window and on the control port... If it is done, why does this bug still cause us problems? Is bootstrapping somehow different?

comment:4 in reply to:  3 Changed 7 years ago by rransom

Replying to mikeperry:

There seem to be two high-level ways to solve this issue:

  1. Reduce the timeout for dirport connection establishment to be very small.
  2. Ask k directory requests at once (perhaps until we start getting answers, then slow down).

Are both good ideas, assuming we could do them right? Can we do the first easier as the second?

Reducing the timeout would break Tor completely. See #1297 (comment 11 and below) for how I cleaned up one part of the mess you made the last time you had Tor's timeouts lowered.

Is some form of the second one already done? I swear I've seen my client do multiple tunneled dir conns in the Vidalia window and on the control port... If it is done, why does this bug still cause us problems? Is bootstrapping somehow different?

Tor performs descriptor (and microdescriptor) fetches over many directory connections in parallel. (Too many -- arma may have opened a ticket for that.) As far as I know, it does not launch multiple attempts to fetch a consensus in parallel.

comment:5 Changed 7 years ago by mikeperry

Hrmm.. In order to support parallel consensus downloads, we probably need HTTP 1.1 range requests, which don't seem to be supported by the dirauths (who only implement HTTP 1.0).

How about this then: Bootstrapping clients make T concurrent test connections to a random selection of the N dirauths and M directory mirrors from #572. When the first connection succeeds, they close the remaining T-1 test connections, download the consensus through it, and then close it. No timeout changes, and we solve the "k nodes are down" bootstrap problem.

We should also ensure that in all cases, some fixed fraction of the T test connections go to the M directory mirrors, because the whole reason this issue is important is because we want to the ability to enter into a "panic mode" if the dirauths are under a resource exhaustion attack or other DoS condition (such as the dirauth crash in the description of #2664).

Does this need a proposal to clarify and debate, or is this plan good enough to begin work on?

comment:6 Changed 7 years ago by nickm

Keywords: dos-resistance needs-proposal added; dos-resistence removed

Definitely needs a proposal.

comment:7 Changed 7 years ago by arma

I had been imagining a design where we launch a further consensus fetch request every 5 seconds if the current one(s) haven't established a connection; but we don't hang up on the earlier ones until a much larger timeout. That way in the good case we don't make any more connections than now, but we failover much faster.

comment:8 in reply to:  7 Changed 7 years ago by mikeperry

Replying to arma:

I had been imagining a design where we launch a further consensus fetch request every 5 seconds if the current one(s) haven't established a connection; but we don't hang up on the earlier ones until a much larger timeout. That way in the good case we don't make any more connections than now, but we failover much faster.

But in the bad case where something like the #2664 crash, a dirauth censorship event, or a #3023-related circuit-based attack destroys dirauth reachability, new clients will still have to wait up to 5*N seconds before even attempting a dir mirror. Since for some reason we want to keep adding new dirauths, we'll soon be back up in the O(minutes) delay range with this plan in these circumstances. It might get even worse than that, if a large number of our #572 fallback dir mirrors end up unreachable.

According to http://petsymposium.org/2012/papers/hotpets12-1-usability.pdf, even 30 seconds of browser launch delay is enough time for 40% of the population to either give up or to assume magic has happened and Tor is already working somehow, even *with* progress bar activity...

comment:9 Changed 7 years ago by mikeperry

Keywords: dirauth-dos-resistance added; dos-resistance removed

comment:10 Changed 7 years ago by mikeperry

Keywords: MikePerry201210d added

comment:11 Changed 7 years ago by nickm

Keywords: tor-client added

comment:12 Changed 7 years ago by nickm

Component: Tor ClientTor

comment:13 Changed 7 years ago by mikeperry

I pushed a draft proposal for this to my torspec.git remote mikeperry/consensus-bootstrap.

comment:14 Changed 7 years ago by nickm

okay. Send it to tor-dev when you think it's ready to be a proposal, like 001-process.txt says. One section I'd suggest you add would be load/performance analysis: how much could this hurt? Is there something else that doesn't hurt so much?

comment:15 Changed 7 years ago by mikeperry

Pushed some updates to https://gitweb.torproject.org/user/mikeperry/torspec.git/blob/consensus-bootstrap:/proposals/xxx-faster-headless-consensus-bootstrap.txt.

I added a performance analysis section and tweaked some parameters. I'll email it in another day or two.

Version 0, edited 7 years ago by mikeperry (next)

comment:16 Changed 7 years ago by mikeperry

Keywords: large-feature added; MikePerry201210d removed

Nick wants to ponder this some more. Tagging with large-feature so we remember to revisit it soon.

comment:17 Changed 6 years ago by nickm

Keywords: prop210 added

For reference, this is proposal 210.

comment:18 Changed 6 years ago by nickm

Keywords: needs-proposal removed
Milestone: Tor: 0.2.4.x-finalTor: unspecified

comment:19 Changed 6 years ago by nickm

Kicking this into "Tor: unspecified," since it is probably not going to get done by 10 Dec, and it seems very likely to have the traits of a "large feature".

comment:20 Changed 4 years ago by nickm

Milestone: Tor: unspecifiedTor: 0.2.8.x-final

Should at least look at this in 028.

comment:21 Changed 4 years ago by arma

I remain excited for somebody to work on this one.

I think it seems clear that we're going to have to launch multiple connections -- that is, that we'll need to launch a second connection even while the first one hasn't finished yet. That's just a fact because of long tcp timeouts. (In particular, consider networks where the active "it didn't work" tcp packets don't make it back to the user.)

Given that, I think Mike's suggested approach in proposal 210 is a fine one.

I especially like that we're retaining a connection attempt to the directory authorities -- since they're the only ones that cause a warn-level log when your clock appears wrong. I believe, but somebody should check, that clients get this warn-level log simply when they finish the TLS connection to the authority, because of the netinfo cell, right? (Notifying users about wrong clocks is one of the sticking points on deploying the current implementation of FallbackDirs (until #2628 is done).

One ambiguity in the proposal: say we've launched our three connections, one to an authority and two to fallbackdirs, and it's time to launch three more. Do we do the same patterns, one to a new authority and two to new fallbackdirs? Or do we say we've got one that's to an authority, and that's good enough, and pick three new from the fallbackdirs? My guess is that the proposal suggests the former, but I think it's up for grabs which is wiser. Probably still the former.

As for implementation, I think we could trigger channel attempts for the three chosen relays, and then have a function that gets called whenever a channel finishes being established, and in that function see if we're in a situation where we want to launch a directory fetch. Should be pretty clean.

comment:22 in reply to:  21 ; Changed 4 years ago by nickm

Replying to arma:

I remain excited for somebody to work on this one.

I think it seems clear that we're going to have to launch multiple connections -- that is, that we'll need to launch a second connection even while the first one hasn't finished yet. That's just a fact because of long tcp timeouts. (In particular, consider networks where the active "it didn't work" tcp packets don't make it back to the user.)

Given that, I think Mike's suggested approach in proposal 210 is a fine one.

Agreed. I can imagine small tweaks to the proposal, where instead of launching 3 connections at once, we launch one connection immediately, then another connection if the first hasn't gotten an answer in a second or two, then a third connection a second or two after that.

I especially like that we're retaining a connection attempt to the directory authorities -- since they're the only ones that cause a warn-level log when your clock appears wrong. I believe, but somebody should check, that clients get this warn-level log simply when they finish the TLS connection to the authority, because of the netinfo cell, right? (Notifying users about wrong clocks is one of the sticking points on deploying the current implementation of FallbackDirs (until #2628 is done).

I believe that's all correct.

One ambiguity in the proposal: say we've launched our three connections, one to an authority and two to fallbackdirs, and it's time to launch three more. Do we do the same patterns, one to a new authority and two to new fallbackdirs? Or do we say we've got one that's to an authority, and that's good enough, and pick three new from the fallbackdirs? My guess is that the proposal suggests the former, but I think it's up for grabs which is wiser. Probably still the former.

Hmm. I kinda like the latter, but I don't have strong feelings there. It would be nice to make an argument either way.

As for implementation, I think we could trigger channel attempts for the three chosen relays, and then have a function that gets called whenever a channel finishes being established, and in that function see if we're in a situation where we want to launch a directory fetch. Should be pretty clean.

comment:23 Changed 4 years ago by mikeperry

Keywords: 028-triage added

comment:24 Changed 4 years ago by nickm

Keywords: performance bootstrap dirauth-dos-resistance tor-client large-feature prop210 028-triageperformance, bootstrap, dirauth-dos-resistance, tor-client, large-feature, prop210, 028-triage

comment:25 Changed 4 years ago by nickm

Points: medium/large

comment:26 Changed 4 years ago by nickm

Priority: normalmajor

comment:27 in reply to:  22 Changed 4 years ago by teor

Created #17217 to deal with the Client*IPv6* options after IPv6 bootstrap works.

comment:28 Changed 4 years ago by teor

Replying to nickm:

Replying to arma:

I remain excited for somebody to work on this one.

I think it seems clear that we're going to have to launch multiple connections -- that is, that we'll need to launch a second connection even while the first one hasn't finished yet. That's just a fact because of long tcp timeouts. (In particular, consider networks where the active "it didn't work" tcp packets don't make it back to the user.)

Given that, I think Mike's suggested approach in proposal 210 is a fine one.

Agreed. I can imagine small tweaks to the proposal, where instead of launching 3 connections at once, we launch one connection immediately, then another connection if the first hasn't gotten an answer in a second or two, then a third connection a second or two after that.

I've revised proposal #210 to implement this exponential backoff scheme, will still maintaining the same number of connections within the first second.
Please see my branch bootstrap-exponential-backoff in https://github.com/teor2345/torspec.git

See my email to tor-dev for the full proposal text.
https://lists.torproject.org/pipermail/tor-dev/2015-October/009622.html

I especially like that we're retaining a connection attempt to the directory authorities -- since they're the only ones that cause a warn-level log when your clock appears wrong. I believe, but somebody should check, that clients get this warn-level log simply when they finish the TLS connection to the authority, because of the netinfo cell, right? (Notifying users about wrong clocks is one of the sticking points on deploying the current implementation of FallbackDirs (until #2628 is done).

I believe that's all correct.

I've specified a retry schedule that we make 1 authority connection and 2 directory mirror connections in the first second. This will:

  • place a similar connection load (TLS) on the authorities
  • reduce download load (data) on the authorities if the mirror (or first few mirrors) are faster
  • continue to warn the user if their clock is wrong by making sure we try to complete at least one authority TLS connection every time we launch (then close it if we are already downloading the directory)
  • other connections are closed as soon as a consensus download starts

One ambiguity in the proposal: say we've launched our three connections, one to an authority and two to fallbackdirs, and it's time to launch three more. Do we do the same patterns, one to a new authority and two to new fallbackdirs? Or do we say we've got one that's to an authority, and that's good enough, and pick three new from the fallbackdirs? My guess is that the proposal suggests the former, but I think it's up for grabs which is wiser. Probably still the former.

Hmm. I kinda like the latter, but I don't have strong feelings there. It would be nice to make an argument either way.

In the updated scheme, we connect to mirrors at a faster rate than authorities, approximately 5 mirror connections for every authority connection. (And even fewer if the connections succeed.)


As for implementation, I think we could trigger channel attempts for the three chosen relays, and then have a function that gets called whenever a channel finishes being established, and in that function see if we're in a situation where we want to launch a directory fetch. Should be pretty clean.

Sounds good, I'll try to make it work along with #15775.

I've also added a section on trying both IPv4 and IPv6 addresses. I think this will allow us to implement IPv6 bootstrap along with #8374.

Changed 4 years ago by teor

Attachment: Fallback Calculations.pdf added

Full fallback reliability calculations, ask teor for editable version

comment:29 Changed 4 years ago by teor

Owner: set to teor
Severity: Normal
Status: newaccepted

I'm working on this in feature4483-v7 at https://github.com/teor2345/tor.git

I've finished editing the proposal (#17235) and the code works, but I'm still writing unit tests.

comment:30 Changed 4 years ago by teor

Status: acceptedneeds_revision

The proposal has been merged to the master branch of torspec as proposals/210-faster-headless-consensus-bootstrap.txt

I have a working implementation for IPv4 that requires:

The implementation is missing:

  • some unit tests,
  • fine-tuning of the connection schedules,
  • thorough testing.

comment:31 Changed 4 years ago by teor

Keywords: TorCoreTeam201511 added

comment:32 Changed 3 years ago by teor

Keywords: TorCoreTeam201512 added; TorCoreTeam201511 removed

This is likely to be completed in December

comment:33 Changed 3 years ago by teor

Status: needs_revisionneeds_review

This change is ready for review as branch feature4483-v10 on https://github.com/teor2345/tor.git

It passes all the chutney "make test-network-all" checks.

The branch includes commits for:

  • Prevent fallback directory mirror loops (#17574)
  • Adjust clock skew message for fallback directories (#17716)

I'll be testing it extensively over the next few days in the following roles:

  • client
  • hidden service
  • bridge

(Anything Tor instance that thinks it should fetch from the authorities doesn't run this multiple simultaneous connection code.)

comment:34 Changed 3 years ago by teor

My initial testing shows that this patch works in the following roles on the live network:

  • client
  • hidden service
  • bridge
  • bridge client

The code for relays etc. remains the same.

comment:35 Changed 3 years ago by teor

Sponsor: SponsorU

This falls under SponsorU's DoS-resistance area.

comment:36 Changed 3 years ago by teor

We're doing the clock skew changes in #17739, I removed the corresponding changes from the branch, and closed #17716 as a duplicate.

Please see feature4483-v11, which is rebased on the latest master.

comment:37 Changed 3 years ago by mikeperry

Keywords: MikePerry201512R added

comment:38 Changed 3 years ago by nickm

Please see feature4483-v11, which is rebased on the latest master.

Did you forget to push this one?

comment:39 Changed 3 years ago by nickm

Starting to review v10, since that's what I see...

(Please do not do a rebase that changes stuff in these commits; instead, make fixup! or squash! commits so that I can review the changes and won't have to re-review what I've already been over.)

452272b4b7c03b1504fb8d04054473b275fa18e2 (#17574) --

  • I am not sure about the must/should distinction at all. We'd like to avoid loops, sure, but I kinda feel like it isn't really true that a fallbackdir must never ask another fallback dir for information. Must think more here.

f4397294e75dc4a419e8dfca95f1cf4e745e842a (#17716) --

  • maybe a router_digest_is_fallback_dir function would be handy to have.

607fdd0a52f4ee41fe0cc27a102db53681a6aff4 (Refactor connection_get_* to produce lists and counts)

  • There is a huge amount of duplicate code here. Having the same 12-line function written more than once is just asking for trouble IMNSHO.
  • strcmp_opt() could simply two of these functions.
  • In connection.h, you're calling free(conns), not smartlist_free(conns).

88821fe930d5275eadec3125e097674124477b30 (Add want_authority to directory_get_from_dirserver)

  • Are the only options really DL_WANT_AUTHORITY and DL_WANT_FALLBACK ? Shouldn't there be a third state for "any cache would be okay" ? Or am I not understanding right? (In that case, please consider making the documentation more idiot-friendly)

bc8f21f7bd4ed3ea52da94f44c58f34919841307 (Add attempt-based connection schedules)

  • .... (more to follow)...

comment:40 in reply to:  39 Changed 3 years ago by teor

Replying to nickm:

Starting to review v10, since that's what I see...

(Please do not do a rebase that changes stuff in these commits; instead, make fixup! or squash! commits so that I can review the changes and won't have to re-review what I've already been over.)

I have added commits to feature4483-v10. (Please ignore feature4483-v11.)

452272b4b7c03b1504fb8d04054473b275fa18e2 (#17574) --

  • I am not sure about the must/should distinction at all. We'd like to avoid loops, sure, but I kinda feel like it isn't really true that a fallbackdir must never ask another fallback dir for information. Must think more here.

I'm not convinced it's needed either, I think the current behaviour is acceptable. (See my detailed comments in #17574.)
We might want to explore it if we decide to make directory mirrors fetch from fallbacks (#17709).

Reverted in 69faddc, which also reverts instances of should_fetch in subsequent commits.

f4397294e75dc4a419e8dfca95f1cf4e745e842a (#17716) --

Reverted in a22cfec, as we're modifying this message as part of the refactoring in #17739.

  • maybe a router_digest_is_fallback_dir function would be handy to have.

Yes, we'll definitely need this at some point.
Added in 76c5b53.

607fdd0a52f4ee41fe0cc27a102db53681a6aff4 (Refactor connection_get_* to produce lists and counts)

See fixup commit f9f03c6 for all the fixes on this commit.

  • There is a huge amount of duplicate code here. Having the same 12-line function written more than once is just asking for trouble IMNSHO.

I've removed the duplicate code in the get* functions, by using template macros.

This adds an extra check to connection_get_by_global_id, but doesn't change the behaviour of tor, because its only caller also makes this check.

  • strcmp_opt() could simply two of these functions.

I've modified those functions to use strcmp_opt.

  • In connection.h, you're calling free(conns), not smartlist_free(conns).

I've removed the duplicate code in the header, by using a template macro.
This also fixes the free() issue.

Oh, and the unit tests still pass, so I didn't break anything in these transformations.

88821fe930d5275eadec3125e097674124477b30 (Add want_authority to directory_get_from_dirserver)

  • Are the only options really DL_WANT_AUTHORITY and DL_WANT_FALLBACK ? Shouldn't there be a third state for "any cache would be okay" ? Or am I not understanding right? (In that case, please consider making the documentation more idiot-friendly)

Ugh, I was so stuck in the bootstrap world I baked it into the name!

See my fixup 8b5085a, where I change all DL_WANT_FALLBACK to DL_WANT_ANY_DIRSERVER, and add appropriate comments. In particular, while I could have changed launch_dummy_descriptor_download_as_needed() to use an authority to get a more trusted source for its IP address, I don't like the idea of some relays contacting an authority every 20 minutes. (So I left the previous behaviour intact.)

bc8f21f7bd4ed3ea52da94f44c58f34919841307 (Add attempt-based connection schedules)

  • .... (more to follow)...

I am ready to receive further instructions...

comment:41 Changed 3 years ago by nickm

bc8f21f7bd4ed3ea52da94f44c58f34919841307 (Add attempt-based connection schedules)

  • download_status_schedule_helper dereferences dls before it does tor_assert(dls); coverity won't like that, I think.
  • download_status_schedule_helper seems like a bad name. It doesn't have a verb. Maybe it should be ..._get_interval?
  • Somewhere, we need to explain how a dls actually works -- what it means, how to interpret it. I don't think the documentation in or.h really explains. Honestly, I'm not sure how this commit works. I think that an expanded comment on download_schedule_t type might be in order.
    • good to see tests though.

3244f51e18832cd507957a589e89452bd98164b3 (Multiple simultaneous connections for clients)

  • N_CONSENSUS_BOOTSTRAP_SOURCES concerns me. What if we think (someday) that we should try three fallbacks and an authority? Or that we should try an authority only after three fallbacks have failed? How much code would we need to discard?

(sleeping time)

comment:42 Changed 3 years ago by nickm

86ac0928577c18a9e27c6c4aae24abf9478ffcfa -- still have to review this but...

  • Is return; really the right thing to do in these directory_initiate_command_rend cases? Why?

comment:43 in reply to:  41 ; Changed 3 years ago by teor

Replying to nickm:

bc8f21f7bd4ed3ea52da94f44c58f34919841307 (Add attempt-based connection schedules)

All fixups for this commit are in 8e50ae7.

  • download_status_schedule_helper dereferences dls before it does tor_assert(dls); coverity won't like that, I think.

Oops, but by the power of C99, I can swap those lines.

  • download_status_schedule_helper seems like a bad name. It doesn't have a verb. Maybe it should be ..._get_interval?

It's actually getting a delay, so I have modified the name of the function and the variables in the function to reflect this.

  • Somewhere, we need to explain how a dls actually works -- what it means, how to interpret it. I don't think the documentation in or.h really explains. Honestly, I'm not sure how this commit works. I think that an expanded comment on download_schedule_t type might be in order.

I have updated the comments on:

  • download_schedule_t
  • download_want_authority_t
  • download_schedule_increment_t
  • download_status_t
  • download_status_increment_failure

I hope this explains how the code works better, let me know if there's anything that could be clearer.

I also made sure n_download_attempts is incremented for all schedules. It doesn't make sense to only track it for attempt-based schedules.

3244f51e18832cd507957a589e89452bd98164b3 (Multiple simultaneous connections for clients)

See fixup cc6b228 for the change for this commit.

  • N_CONSENSUS_BOOTSTRAP_SOURCES concerns me. What if we think (someday) that we should try three fallbacks and an authority? Or that we should try an authority only after three fallbacks have failed? How much code would we need to discard?

TL;DR: We currently try 3 fallbacks and 2 authorities, if we need to.

Sorry for naming N_CONSENSUS_BOOTSTRAP_SOURCES badly, it's actually the number of different schedules used to download a document, not the number of sources.
(Modified the name to N_CONSENSUS_BOOTSTRAP_SCHEDULES.)

I think that this commit is unclear because the previous commit was poorly documented.

The schedules determine how many authorities and fallbacks are attempted, and in which order.

For example, the current default schedules are:

TestingClientBootstrapConsensusAuthorityDownloadSchedule 10, 11, ...
TestingClientBootstrapConsensusFallbackDownloadSchedule  0, 1, 4, 11, ...

This means that:
A fallback is tried immediately.
If a connection hasn't been successful after 1 second, another fallback is tried.
If a connection hasn't been successful after 5 seconds (1+4), another fallback is tried.
If a connection hasn't been successful after 10 seconds, an authority is tried.
If a connection hasn't been successful after 16 seconds (1+4+11), another fallback is tried.
If a connection hasn't been successful after 21 seconds (10+11), another authority is tried.

(We can tune the schedules to manage the tradeoff between fallback load, authority load, and bootstrap success rate.)

comment:44 in reply to:  42 Changed 3 years ago by teor

Replying to nickm:

86ac0928577c18a9e27c6c4aae24abf9478ffcfa -- still have to review this but...

  • Is return; really the right thing to do in these directory_initiate_command_rend cases? Why?

"return" has the effect of cancelling the current connection attempt, which is what we want to do if the connection was an excess connection.

It's what the function already does when the connection fails (connection_connect() returns -1 or !linked_conn), and in all the other error cases.

However, it doesn't make sense to make a connection, then immediately close it. So I created connection_dir_avoid_extra_connection_for_purpose() and call it at the top of the function. I updated the unit tests (including fixing a bug in the unit tests) and updated some comments as well.

See my commit 087ae73 for these fixups.

comment:45 in reply to:  43 Changed 3 years ago by teor

Replying to teor:

I also made sure n_download_attempts is incremented for all schedules. It doesn't make sense to only track it for attempt-based schedules.

But I didn't update the unit tests for that. And, while I was updating the unit tests, I discovered I was incrementing it past IMPOSSIBLE_TO_DOWNLOAD. Which is bad.

Fixed in 6415d37.

comment:46 Changed 3 years ago by nickm

Okay, I think I get it. Thanks for all the replies and comments; this patch is looking good.

Last questions:

  • How have you tested this in the wild? How well does it do? How sure are you that it doesn't open a lot of extra connections, and that it actually behaves as you think it does?
  • Could you do one last rebase to squash the fixup! commits and remove the reverted commits? I could try to do that, but I got conflicts when I did, and I'm worried about getting it wrong.

comment:47 in reply to:  46 Changed 3 years ago by teor

Replying to nickm:

Okay, I think I get it. Thanks for all the replies and comments; this patch is looking good.

Last questions:

  • How have you tested this in the wild? How well does it do? How sure are you that it doesn't open a lot of extra connections, and that it actually behaves as you think it does?

Improvements

I'm very confident that the code works as expected, and provides significantly better bootstrap reliability for clients. (And it gets even better when we can use fallback directories as well - see #15775.)

I've tested it as a client, hidden service, and bridge client, using chutney and in the wild, on multiple OSs (OS X, Linux), with and without fallback directories. It bootstraps correctly in all these cases.

It opens connections exactly on schedule (regardless of whether the other connections fail or are delayed), and it stops when it gets a consensus.

I've been running it as my Tor Browser tor for a week or two.
(The code is only active for clients before they download a consensus. This limits its impact to client bootstrap.)

Safety

The current schedules limit tor to trying:

  • If there are no fallbacks: 3 authority connections over 10 seconds
    • I feel that adding one authority connection is the minimum we could do to make boostrapping more reliable, and it's a good tradeoff between reliability and load.
  • If there are fallbacks: 4 fallbacks and 2 authorities over 21 seconds
    • We don't add any extra authorities in this case, we expect 4 fallbacks to provide much more reliability than 1 authority.

Then each schedule waits for an hour before trying again.

There is code in the patch that limits tor to 3 simultaneous connections.

The existing code is used to limit the number of connection attempts/failures to 4 (authorities only) or 7 (fallbacks and authorities). Tor then waits an hour before trying again (the existing "slow zombie" behaviour). It will also retry if there's a SOCKSPort request.

  • Could you do one last rebase to squash the fixup! commits and remove the reverted commits? I could try to do that, but I got conflicts when I did, and I'm worried about getting it wrong.

The squashed commit is at feature4483-v10-squashed.

I'm sorry about that, some of my commits have dependencies on one another. And one of the fixups was mislabeled. I'm trying to get better at this.
(There was also a conflict with the clock skew changes that were merged.)

I made one further fix:

  • I changed the name of CONSENSUS_BOOTSTRAP_SOURCE_FALLBACK to CONSENSUS_BOOTSTRAP_SOURCE_ANY_DIRSERVER to match the change to DL_WANT_ANY_DIRSERVER.

(If you want to check that the squash didn't change anything (except the name change above), a plain rebase with no squashes is at feature4483-v10-rebased. It's not easy to do this with feature4483-v10, as it was based on master pre-0.2.7.6 release.)

comment:48 Changed 3 years ago by nickm

Resolution: implemented
Status: needs_reviewclosed

Okay, sounds pretty solid to me. Merged the "squashed" branch. Thanks!

comment:49 Changed 3 years ago by cypherpunks

Resolution: implemented
Status: closedreopened

Sorry to reopen but one of the tests introduced by this ticket is failing on Jenkins.

config/use_multiple_directories: 
  FAIL ../src/test/test_config.c:3489: assert(networkstatus_consensus_can_use_multiple_directories(options) == 0)
  [use_multiple_directories FAILED]

comment:50 in reply to:  49 Changed 3 years ago by teor

Replying to cypherpunks:

Sorry to reopen but one of the tests introduced by this ticket is failing on Jenkins.

config/use_multiple_directories: 
  FAIL ../src/test/test_config.c:3489: assert(networkstatus_consensus_can_use_multiple_directories(options) == 0)
  [use_multiple_directories FAILED]

I saw that error, but I can't replicate it on OS X using master.

Which OS(es) does this affect?

comment:51 Changed 3 years ago by teor

I can't replicate this on:

  • Linux x86_64
  • OS X x86_64 or i386

I'm guessing it's either:

  • a transient issue that's affected by the local machine and/or previous tests,
  • an issue that's since been fixed in master, or
  • an issue that only occurs BSD or Windows or a rare Linux configuration.

Can you let me know which platform(s) are failing this test?

I suspect this might be because that line depends on the result of router_pick_published_address(). One way to fix this is to set DirPort_set in the previous line, instead of ORPort_set.

comment:52 in reply to:  51 Changed 3 years ago by cypherpunks

Replying to teor:

Can you let me know which platform(s) are failing this test?

I can't replicate it either locally. I just saw Jenkins having a bunch of red dots for Debian Jessie and FreeBSD so i reopened to get it fixed.

I suspect this might be because that line depends on the result of router_pick_published_address().

I'm also suspecting this.

One way to fix this is to set DirPort_set in the previous line, instead of ORPort_set.

IMO if the expectation is that OR servers should fetch from authorities we should really test this instead of changing the test just so it succeeds. We should investigate further so we find the root cause of the problem and fix that instead.

comment:53 Changed 3 years ago by nickm

My guess (and this is just a guess) is that directory_fetches_from_authorities() is returning 0 because router_pick_published_address() is failing here.

comment:54 in reply to:  53 ; Changed 3 years ago by cypherpunks

Replying to nickm:

My guess (and this is just a guess) is that directory_fetches_from_authorities() is returning 0 because router_pick_published_address() is failing here.

Which means router_pick_published_address is actually succeeding when guessing our address while we're expecting it not to.

comment:55 in reply to:  54 Changed 3 years ago by teor

Replying to cypherpunks:

Replying to nickm:

My guess (and this is just a guess) is that directory_fetches_from_authorities() is returning 0 because router_pick_published_address() is failing here.

Which means router_pick_published_address is actually succeeding when guessing our address while we're expecting it not to.

Yes, I think I was getting a little worn out by all the unit tests at that point.

Please see my branch fix-multi-dir, which:

  • Fixes an issue where some OR servers would bootstrap using multiple connections (and some clients would not)
    • This happened because directory_fetches_from_authorities and public_server_mode give pretty similar results for common cases
  • Ensures the fallback check works
  • Expands the failing unit test to check that relays, exits, and directory servers all fetch directories the way we expect
    • This test now mocks the relevant functions so the results don't depend on the machine

comment:56 Changed 3 years ago by nickm

Resolution: implemented
Status: reopenedclosed

Okay, merging that. Thanks!

Note: See TracTickets for help on using tickets.