Opened 16 months ago

Last modified 3 months ago

#25061 assigned defect

Relays consider it a bootstrapping failure if they can't extend for somebody else's circuit

Reported by: arma Owned by: catalyst
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: backport-032, stability, 033-triage-20180320, 033-included-20180320, s8-errors, bootstrap, 034-deferred-20180602, 035-removed-20180711, 033-backport-unreached
Cc: catalyst Actual Points:
Parent ID: Points:
Reviewer: Sponsor: Sponsor19-can

Description

Say you have a relay that is up and listed as non-slow in the consensus. Due to the current overload (#24902), this relay is getting many many circuit requests per second. Due to bug #24767, we will make a huge number of connection attempts to other relays that are down, because as soon as we get a "connection refused", we will get another circuit request that triggers another connection attempt.

So when your relay restarts, since it's still in the consensus and clients still think it's usable, it will immediately get flooded with circuit requests, causing these connection attempts to resume.

And Tor calls every one of those connection attempts a bootstrapping attempt, even if there are no origin circuits related to that connection attempt.

Child Tickets

TicketStatusOwnerSummaryComponent
#24639closedAfter I updated my relay to Tor 0.3.2.7-rc I got a large amount of Warn errorsCore Tor/Tor

Change History (23)

comment:1 Changed 16 months ago by arma

Cc: catalyst added

cc'ing catalyst since they've been working on bootstrapping stuff and might be the perfect person for doing up (and/or reviewing) a patch

comment:2 Changed 16 months ago by arma

I noticed this bug on my freeBogatov relay, and then a relay operator reported it on tor-relays too:
https://lists.torproject.org/pipermail/tor-relays/2018-January/014316.html
so I figured that was the right time to file a ticket.

Jan 25 18:31:54.000 [warn] Problem bootstrapping. Stuck at 90%: Establishing a Tor circuit. (Connection timed out; TIMEOUT; count 1503; recommendation warn; host B7047FBDE9C53C39011CA84E5CB2A8E3543066D0 at 199.249.223.60:443)
Jan 25 18:31:54.000 [warn] 1503 connections have failed:
Jan 25 18:31:54.000 [warn]  1026 connections died in state connect()ing with SSL state (No SSL object)
Jan 25 18:31:54.000 [warn]  392 connections died in state handshaking (TLS) with SSL state SSLv2/v3 read server hello A in HANDSHAKE
Jan 25 18:31:54.000 [warn]  84 connections died in state handshaking (Tor, v3 handshake) with SSL state SSL negotiation finished successfully in OPEN
Jan 25 18:31:54.000 [warn]  1 connections died in state handshaking (TLS) with SSL state unknown state in HANDSHAKE

comment:3 Changed 16 months ago by dgoulet

Keywords: 033-must added

I'm "must-ing" this ticket because I think we should avoid printing those and should properly report the bootstrap failure.

comment:4 Changed 14 months ago by nickm

Keywords: stability added

comment:5 Changed 14 months ago by nickm

Keywords: 033-triage-20180320 added

Marking all tickets reached by current round of 033 triage.

comment:6 Changed 14 months ago by nickm

Keywords: 033-included-20180320 added

Mark 033-must tickets as triaged-in for 0.3.3

comment:7 Changed 14 months ago by catalyst

Owner: set to catalyst
Status: newassigned

comment:8 Changed 14 months ago by arma

The hint is that this warn happens through control_event_bootstrap_prob_or() which is in four places in connection_or.c. That call happens when an OR connection fails, and none of those four callsites check whether it's an origin circuit, that is, whether it's "our" circuit or somebody else's.

You'll notice that somewhere along the line we wrapped those calls with

   if (!authdir_mode_tests_reachability(options)

Commit d37fae2f is the commit from ancient history to skim, and commit 818332e7 is the more recent commit in this area that should give you some more context.

The fix might start with:

  • In connection_or.c, when we're considering whether to call control_event_bootstrap_prob_or, break that "if authdir_mode_tests_reachability" check out its into own function, called something like "could this connection be a bootstrap attempt", which checks if it's a reachability test, but also does more checks.

The trouble here is that in these "more checks", we want to check if there are any origin circuits pending on that connection attempt. And it looks like we don't actually know that here.

So the first option I thought of is that in that "could it be" function, we want to call circuit_get_all_pending_on_channel(), and then iterate over the resulting list to see if any of them are CIRCUIT_IS_ORIGIN(). That's certainly the most straightforward solution, in that it doesn't go mess with other code much. But it's kind of crummy, because we're walking another list (twice, in fact) every time there's a connection failure. It's not *so* bad though, because somewhere along the line somebody went to the trouble of making a separate list just for the circuits that are pending a channel attempt -- lucky us.

My second thought is: wait, we already *do* walk that list, later on, when we're trying to handle all of the circuits that were waiting for this channel to succeed or fail. That happens in circuit_n_chan_done(). So in the

      if (!status) { /* chan failed; close circ */

clause in that loop, we could check if it's an origin circuit, and if so set a flag that makes us do a bootstrap complaint if the flag was ever set to 1.

But! circuit_n_chan_done() gets called from connection_or_about_to_close(), which is way after we know (and have then forgotten) *why* the conn closed. So that's still not best.

Could we add a field to or_connection_t remembering why the connection failed, and then use it during circuit_n_chan_done() to report bootstrap issues accurately? Maybe -- let's call that approach one.

A third option is: how about we set a flag in or_connection_t, which is "has an origin circuit ever been interested in my outcome", and that flag starts off as 0, and it gets set to 1 when we're about to add a circuit to circuits_pending_chans on a circuit where CIRCUIT_IS_ORIGIN() is true.

But! We only add to circuits_pending_chans inside circuit_set_state(), and that function doesn't know which chan or conn we're planning to attach this circ to.

The better callsite would be
circuit_handle_first_hop(), which is entirely for origin circuits. There are two cases we need to handle here. One is when there isn't a good conn available, and we decide to launch one for this circuit. That's the easy case: if channel_connect_for_circuit() returns NULL, and we launch a new one and it worked, we set the "we wanted this channel for an origin circuit" flag on it right then. The other is if channel_get_for_extend returned NULL but didn't set should_launch, meaning there is a conn somewhere out there, but it's not ready yet -- and we can't easily get ahold of it from this function.

For that second case, check out channel_get_for_extend(). There are two places that call it: circuit_handle_first_hop(), which is for origin circuits, and circuit_extend(), which is for handling EXTEND cells so it is explicitly not for origin circuits. If we added a flag to that function, to say whether it's an origin circuit we're asking on behalf of, then *it* can mark conns that had origin circuits asking about them. The only case I can see in that function where we would need to mark the chan is when we're incrementing n_inprogress_goodaddr.

So, let's call that "add a flag to the channel and turn it on when an origin circuit inquired" idea approach two.

Last edited 14 months ago by arma (previous) (diff)

comment:9 Changed 14 months ago by arma

I pushed a bug25061-demo commit to my arma repo, which sketches out how approach two would look.

comment:10 Changed 14 months ago by catalyst

I'm still looking into how best to do this. I think I'm starting to wrap my brain around it. What you call approach one and approach two differ in several ways, one of which is I suspect approach two would tend to minimize textual changes to existing code (which might be a reasonable goal for a 0.3.3 patch). A downside of approach two is pushing more higher-level state and logic into lower-level code, so I think it's not the best solution in the long term.

The status quo is that lower-level code has logic to use higher-level state to decide whether to call control_event_bootstrap_prob_or(). I think we should restructure things so that higher-level code has the necessary information from lower levels to make this decision, which weighs in favor of approach one. Maybe that's best done as a separate ticket, but I would like to try to make the existing abstractions cleaner instead of messier.

comment:11 Changed 14 months ago by nickm

Catalyst: given your summary above, I agree that approach two is better for an 0.3.3 patch, especially if we're thinking of backporting it.

comment:12 in reply to:  8 Changed 14 months ago by catalyst

Replying to arma:

You'll notice that somewhere along the line we wrapped those calls with

   if (!authdir_mode_tests_reachability(options)

Commit d37fae2f is the commit from ancient history to skim, and commit 818332e7 is the more recent commit in this area that should give you some more context.

It looks like commit c6a94718cd was the one that introduced the silencing of connection problems for dirauths.

Also some of the calls to authdir_mode_tests_reachability() are indirect through connection_or_connect_failed().

comment:13 Changed 13 months ago by catalyst

Keywords: s8-errors bootstrap added
Sponsor: Sponsor8

comment:14 Changed 13 months ago by catalyst

Keywords: 033-backport added; 033-must removed
Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final

Bump to 0.3.4 per nickm.

comment:15 Changed 13 months ago by catalyst

There are four places that call control_event_bootstrap_prob_or():

  • connection_or_about_to_close() in connection_or.c
    • This seems to have a fair amount of code in the TLS handshake failure case, which could probably be factored out. It also calls channel_closed() before any of the other stuff, which maybe should be reordered?
    • It's only really called through connection_unlink() in main.c.
  • connection_or_connect_failed() in connection_or.c
    • This fairly simple function has multiple callers
  • connection_or_connect() in connection_or.c
    • This seems to only handle the case where a PT proxy is missing. Notably, it does NOT call authdir_mode_tests_reachability().
  • connection_or_client_learned_peer_id() in connection_or.c
    • This seems to be when a router cert has an unexpected ID.

I'm thinking of ways to move the logic that currently calls control_event_bootstrap_prob_or() into either the channel layer or even higher. I think having the callers of control_event_bootstrap_prob_or() call authdir_mode_tests_reachability() themselves is unnecessary when control_event_bootstrap_prob_or() itself could probably safely do so. (The one case above where the caller doesn't call authdir_mode_tests_reachability() is a pluggable transport situation, and dirauths don't use pluggable transports that I know of.)

comment:16 Changed 12 months ago by nickm

Keywords: 034-deferred-20180602 added
Milestone: Tor: 0.3.4.x-finalTor: 0.3.5.x-final

Deferring non-must tickets to 0.3.5

comment:17 Changed 11 months ago by nickm

Keywords: 035-removed-20180711 added
Milestone: Tor: 0.3.5.x-finalTor: unspecified

These tickets are being triaged out of 0.3.5. The ones marked "035-roadmap-proposed" may return.

comment:18 Changed 7 months ago by catalyst

We should double check whether this merely makes scary-sounding warnings, or if it actually prevents a relay from completing bootstrapping.

comment:19 Changed 7 months ago by arma

I believe it just produces scary-sounding warnings: the bootstrap complaint logs are entirely separate from Tor's decision to try to launch circuits and fetch things and etc.

comment:20 in reply to:  19 Changed 7 months ago by catalyst

Sponsor: Sponsor8Sponsor8-can

Replying to arma:

I believe it just produces scary-sounding warnings: the bootstrap complaint logs are entirely separate from Tor's decision to try to launch circuits and fetch things and etc.

Thanks. I'll put it in sponsor8-can, then. I think it's also lower priority than bootstrap reporting issues that cause client-facing UX problems.

comment:21 Changed 5 months ago by catalyst

Sponsor: Sponsor8-canSponsor19-can

comment:22 Changed 3 months ago by teor

Keywords: 033-backport removed

These open, non-merge_ready tickets can not get backported to 0.3.3, because 0.3.3 is now unsupported.

comment:23 Changed 3 months ago by teor

Keywords: 033-backport-unreached added

Hmm, I guess they should still get 033-backport-unreached

Note: See TracTickets for help on using tickets.