It seems to me that if just a moment ago you tried to connect to a relay and you failed for reason connectrefused, then maybe you already know what the answer is going to be for other circuits that show up in that second.
How long should we cache the answer for? And, are there any anonymity implications to this design change? (We already batch circuit extend requests as they're waiting for the connect request to finish or fail, so I think whatever the anonymity implications are, maybe we already have them?)
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
I gave this ticket a title to better reflect its urgency.
It's an important ticket because many fast relays are flooding their conntables or equivalent with this sustained flood of outgoing connection attempts.
Trac: Summary: If you just failed to extend a circuit to a relay that's down, consider remembering that for a bit to All relays are constantly connecting to down relays and failing over and over
When we fix this, we should delay connection attempts if:
there is already a connection attempt in progress (#24841 (moved))
I think we already do this part correctly. See the NCIRCS=3 argument for the event -- that's the number of circuits that got queued pending on this connection attempt's outcome.
I can take this one. But I want to get this clear. If I summarize what we have here and what we need to decide:
Fortunately, we do track OR connection failure in rephist, see rep_hist_note_connect_failed() so we could use the or_history_t to know if we should try to connect.
We'll need to decide on a "how long before I retry to reconnect" timing. We could do a simple "backoff algorithm" like 1 sec, 5 sec, 10 sec and give up.
Once I've identified that tor can't reach a specific relay, how long should we consider it "unreachable"? Or should we repeat the above if no retries are pending?
Should we keep the EXTEND cells while waiting for a reasonable amount of time before reconnecting? That is, if we have a failure to connect, should we just DESTROY the circuit or retry to connect and send the cell(s) later?
Should we consider the disconnects and "I've just waited 60 sec before I got anything back from you" as potential failures also to note down?
Trac: Keywords: N/Adeleted, tor-relay, tor-dos, performance added Status: new to accepted Priority: Medium to Very High Owner: N/Ato dgoulet
No need for a complex backoff thing. Just remember the failure for 60 seconds, and during those 60 seconds, send back a destroy immediately. Reducing from n attempts per minute down to 1 per minute should be enough to help us survive until the clients get a consensus update and stop asking us to try.
We should avoid having this cached-failure thing impact client behavior. That is, it should cause other people's circuits to get destroys, but it shouldn't auto-fail our own client attempts. Maybe we should change how the client behaves, but if so, let's do it later, and not introduce subtle breakage in something we'll be considering for backport.
Hm! I was going to say "but rep_hist_note_connect_failed() won't work if the relay isn't in our consensus", but actually, it is simply based on intended identity digest of the next destination, so it does look like we can reuse the data struct. Except, shouldn't we also be caching the IP:port that failed? Otherwise somebody can ask us to extend to a victim digest at an unworkable IP:port, and we'll cache "failure!" and then refuse all the legit cells going to the right address for the next minute.
No need for a complex backoff thing. Just remember the failure for 60 seconds, and during those 60 seconds, send back a destroy immediately. Reducing from n attempts per minute down to 1 per minute should be enough to help us survive until the clients get a consensus update and stop asking us to try.
60 seconds seems sensible.
And it's about the timeframe it takes to restart a relay (tor takes 30s to shut down by default).
Anything less would be ineffective.
We should avoid having this cached-failure thing impact client behavior. That is, it should cause other people's circuits to get destroys, but it shouldn't auto-fail our own client attempts. Maybe we should change how the client behaves, but if so, let's do it later, and not introduce subtle breakage in something we'll be considering for backport.
Excluding origin circuits seems fine.
(So does including them, but only on relays. But we already have backoff and limits on origin circuits.)
Hm! I was going to say "but rep_hist_note_connect_failed() won't work if the relay isn't in our consensus", but actually, it is simply based on intended identity digest of the next destination, so it does look like we can reuse the data struct. Except, shouldn't we also be caching the IP:port that failed? Otherwise somebody can ask us to extend to a victim digest at an unworkable IP:port, and we'll cache "failure!" and then refuse all the legit cells going to the right address for the next minute.
When we introduce IPv6 extends from relays (#24404 (moved)), we only want to fail attempts to a single IP address, and not the whole relay. So dual-stack relays will get two attempts to other dual-stack relays: an IPv4 attempt, and an IPv6 attempt. I like that design, it makes sense to try both.
Ok, after discussion on IRC with arma a bit this morning, I've ended up with a branch.
This is untested (although it compiles and chutney is happy), there are no unit tests nor "in the wild" proper testing done on it for now.
I just want to throw it here as a full draft on this because 1) it adds quite a bit of code and 2) I want a review of the code logic before going in review mode.
(FWIW, this has also been a finding of dawuud months ago , where he scanned the Tor network for partitions and he found that many high bw relays cannot contact other high bw relays. I guess he tried a single circuit build between those relays, so basically his findings are isomorphic to the ones from nusenu.)