All relays are constantly connecting to down relays and failing over and over

changed milestone to %Tor: 0.3.3.x-final

added 033-included-20180320 033-must 033-triage-20180320 component::core tor/tor milestone::Tor: 0.3.3.x-final owner::dgoulet performance priority::very high resolution::fixed review-group-32 review-group-34 reviewer::asn severity::normal status::closed tor-dos tor-relay type::enhancement labels

Whatever we do here, we should make it per-IP-family as part of #24405 (moved).

Deferring various "new"-status enhancement tickets to 0.3.4

Trac:
Milestone: Tor: 0.3.3.x-final to Tor: 0.3.4.x-final

I gave this ticket a title to better reflect its urgency.

It's an important ticket because many fast relays are flooding their conntables or equivalent with this sustained flood of outgoing connection attempts.

Trac:
Summary: If you just failed to extend a circuit to a relay that's down, consider remembering that for a bit to All relays are constantly connecting to down relays and failing over and over

Trac:
Milestone: Tor: 0.3.4.x-final to Tor: 0.3.3.x-final
Keywords: N/A deleted, must-fix-before-033-stable added

When we fix this, we should delay connection attempts if:

there is already a connection attempt in progress (#24841 (moved)), or
a connection attempt recently failed.

Replying to teor:

When we fix this, we should delay connection attempts if:

there is already a connection attempt in progress (#24841 (moved))

I think we already do this part correctly. See the NCIRCS=3 argument for the event -- that's the number of circuits that got queued pending on this connection attempt's outcome.

I can take this one. But I want to get this clear. If I summarize what we have here and what we need to decide:

Fortunately, we do track OR connection failure in rephist, see rep_hist_note_connect_failed() so we could use the or_history_t to know if we should try to connect.
We'll need to decide on a "how long before I retry to reconnect" timing. We could do a simple "backoff algorithm" like 1 sec, 5 sec, 10 sec and give up.
Once I've identified that tor can't reach a specific relay, how long should we consider it "unreachable"? Or should we repeat the above if no retries are pending?
Should we keep the EXTEND cells while waiting for a reasonable amount of time before reconnecting? That is, if we have a failure to connect, should we just DESTROY the circuit or retry to connect and send the cell(s) later?
Should we consider the disconnects and "I've just waited 60 sec before I got anything back from you" as potential failures also to note down?

Trac:
Keywords: N/A deleted, tor-relay, tor-dos, performance added
Status: new to accepted
Priority: Medium to Very High
Owner: N/A to dgoulet

My thoughts for the first design would be:

No need for a complex backoff thing. Just remember the failure for 60 seconds, and during those 60 seconds, send back a destroy immediately. Reducing from n attempts per minute down to 1 per minute should be enough to help us survive until the clients get a consensus update and stop asking us to try.
We should avoid having this cached-failure thing impact client behavior. That is, it should cause other people's circuits to get destroys, but it shouldn't auto-fail our own client attempts. Maybe we should change how the client behaves, but if so, let's do it later, and not introduce subtle breakage in something we'll be considering for backport.
Hm! I was going to say "but rep_hist_note_connect_failed() won't work if the relay isn't in our consensus", but actually, it is simply based on intended identity digest of the next destination, so it does look like we can reuse the data struct. Except, shouldn't we also be caching the IP:port that failed? Otherwise somebody can ask us to extend to a victim digest at an unworkable IP:port, and we'll cache "failure!" and then refuse all the legit cells going to the right address for the next minute.

Replying to arma:

My thoughts for the first design would be:

No need for a complex backoff thing. Just remember the failure for 60 seconds, and during those 60 seconds, send back a destroy immediately. Reducing from n attempts per minute down to 1 per minute should be enough to help us survive until the clients get a consensus update and stop asking us to try.

60 seconds seems sensible. And it's about the timeframe it takes to restart a relay (tor takes 30s to shut down by default). Anything less would be ineffective.

We should avoid having this cached-failure thing impact client behavior. That is, it should cause other people's circuits to get destroys, but it shouldn't auto-fail our own client attempts. Maybe we should change how the client behaves, but if so, let's do it later, and not introduce subtle breakage in something we'll be considering for backport.

Excluding origin circuits seems fine. (So does including them, but only on relays. But we already have backoff and limits on origin circuits.)

Hm! I was going to say "but rep_hist_note_connect_failed() won't work if the relay isn't in our consensus", but actually, it is simply based on intended identity digest of the next destination, so it does look like we can reuse the data struct. Except, shouldn't we also be caching the IP:port that failed? Otherwise somebody can ask us to extend to a victim digest at an unworkable IP:port, and we'll cache "failure!" and then refuse all the legit cells going to the right address for the next minute.

When we introduce IPv6 extends from relays (#24404 (moved)), we only want to fail attempts to a single IP address, and not the whole relay. So dual-stack relays will get two attempts to other dual-stack relays: an IPv4 attempt, and an IPv6 attempt. I like that design, it makes sense to try both.

Ok, after discussion on IRC with arma a bit this morning, I've ended up with a branch.

This is untested (although it compiles and chutney is happy), there are no unit tests nor "in the wild" proper testing done on it for now.

I just want to throw it here as a full draft on this because 1) it adds quite a bit of code and 2) I want a review of the code logic before going in review mode.

Branch: bug24767_033_01

Trac:
Status: accepted to needs_review

(FWIW, this has also been a finding of dawuud months ago , where he scanned the Tor network for partitions and he found that many high bw relays cannot contact other high bw relays. I guess he tried a single circuit build between those relays, so basically his findings are isomorphic to the ones from nusenu.)

Trac:
Keywords: N/A deleted, review-group-32 added

Trac:
Reviewer: N/A to asn

must-fix-before-033-stable is now 033-must

Trac:
Keywords: must-fix-before-033-stable deleted, 033-must added

OnionGit: https://oniongit.eu/dgoulet/tor/merge_requests/21

Pushed unittest in my branch bug24767_033_01 this should provide almost 100% coverage of all the new code.

I also did a review on oniongit.

Trac:
Status: needs_review to needs_revision

Ok review addressed.

The main commit is still "WIP" so once you ACK this branch, I'll have to squash + amend for the upstream merge (and add a changes file).

Second thing, all this code is currently flat in connection_or.c ... maybe we want it in dos.c or some other appropriate file? No strong opinion.

Trac:
Status: needs_revision to needs_review

Fixes LGTM.

The commit at 33628d8 has various mistypes, so please address those too :) Also don't forget to merge my unittest please.

Other than that, LGTM.

All relays are constantly connecting to down relays and failing over and over

Child items ...

Activity