Opened 11 months ago

Closed 8 months ago

#24767 closed enhancement (fixed)

All relays are constantly connecting to down relays and failing over and over

Reported by: arma Owned by: dgoulet
Priority: Very High Milestone: Tor: 0.3.3.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-relay, tor-dos, performance, review-group-32, 033-must, review-group-34, 033-triage-20180320, 033-included-20180320
Cc: Actual Points:
Parent ID: Points:
Reviewer: asn Sponsor:

Description

Starlight at
https://lists.torproject.org/pipermail/tor-relays/2017-December/014001.html
points out that when a relay is receiving many extend cells to extend circuits to a relay that's down, it will try over and over to make a new connection to that next relay:

650 ORCONN $5E2975B380F6E908AFB29E0D8D0B967AF73B41C2~WombleNode01 LAUNCHED ID=61580092
650 ORCONN $5E2975B380F6E908AFB29E0D8D0B967AF73B41C2~WombleNode01 FAILED REASON=CONNECTREFUSED NCIRCS=4 ID=61580092
650 ORCONN $5E2975B380F6E908AFB29E0D8D0B967AF73B41C2~WombleNode01 LAUNCHED ID=61580095
650 ORCONN $5E2975B380F6E908AFB29E0D8D0B967AF73B41C2~WombleNode01 FAILED REASON=CONNECTREFUSED NCIRCS=3 ID=61580095
650 ORCONN $5E2975B380F6E908AFB29E0D8D0B967AF73B41C2~WombleNode01 LAUNCHED ID=61580098
[...]

It seems to me that if just a moment ago you tried to connect to a relay and you failed for reason connectrefused, then maybe you already know what the answer is going to be for other circuits that show up in that second.

How long should we cache the answer for? And, are there any anonymity implications to this design change? (We already batch circuit extend requests as they're waiting for the connect request to finish or fail, so I think whatever the anonymity implications are, maybe we already have them?)

Child Tickets

Change History (44)

comment:1 Changed 11 months ago by teor

Whatever we do here, we should make it per-IP-family as part of #24405.

comment:2 Changed 10 months ago by nickm

Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final

Deferring various "new"-status enhancement tickets to 0.3.4

comment:3 Changed 10 months ago by arma

Summary: If you just failed to extend a circuit to a relay that's down, consider remembering that for a bitAll relays are constantly connecting to down relays and failing over and over

I gave this ticket a title to better reflect its urgency.

It's an important ticket because many fast relays are flooding their conntables or equivalent with this sustained flood of outgoing connection attempts.

comment:4 Changed 10 months ago by teor

Keywords: must-fix-before-033-stable added
Milestone: Tor: 0.3.4.x-finalTor: 0.3.3.x-final

comment:5 Changed 10 months ago by teor

When we fix this, we should delay connection attempts if:

  • there is already a connection attempt in progress (#24841), or
  • a connection attempt recently failed.

comment:6 in reply to:  5 Changed 10 months ago by arma

Replying to teor:

When we fix this, we should delay connection attempts if:

  • there is already a connection attempt in progress (#24841)

I think we already do this part correctly. See the NCIRCS=3 argument for the event -- that's the number of circuits that got queued pending on this connection attempt's outcome.

comment:7 Changed 10 months ago by dgoulet

Keywords: tor-relay tor-dos performance added
Owner: set to dgoulet
Priority: MediumVery High
Status: newaccepted

I can take this one. But I want to get this clear. If I summarize what we have here and what we need to decide:

  • Fortunately, we do track OR connection failure in rephist, see rep_hist_note_connect_failed() so we could use the or_history_t to know if we should try to connect.
  • We'll need to decide on a "how long before I retry to reconnect" timing. We could do a simple "backoff algorithm" like 1 sec, 5 sec, 10 sec and give up.
  • Once I've identified that tor can't reach a specific relay, how long should we consider it "unreachable"? Or should we repeat the above if no retries are pending?
  • Should we keep the EXTEND cells while waiting for a reasonable amount of time before reconnecting? That is, if we have a failure to connect, should we just DESTROY the circuit or retry to connect and send the cell(s) later?
  • Should we consider the disconnects and "I've just waited 60 sec before I got anything back from you" as potential failures also to note down?

comment:8 Changed 10 months ago by arma

My thoughts for the first design would be:

  • No need for a complex backoff thing. Just remember the failure for 60 seconds, and during those 60 seconds, send back a destroy immediately. Reducing from n attempts per minute down to 1 per minute should be enough to help us survive until the clients get a consensus update and stop asking us to try.
  • We should avoid having this cached-failure thing impact client behavior. That is, it should cause *other people's* circuits to get destroys, but it shouldn't auto-fail our own client attempts. Maybe we should change how the client behaves, but if so, let's do it later, and not introduce subtle breakage in something we'll be considering for backport.
  • Hm! I was going to say "but rep_hist_note_connect_failed() won't work if the relay isn't in our consensus", but actually, it is simply based on intended identity digest of the next destination, so it does look like we can reuse the data struct. Except, shouldn't we also be caching the IP:port that failed? Otherwise somebody can ask us to extend to a victim digest at an unworkable IP:port, and we'll cache "failure!" and then refuse all the legit cells going to the right address for the next minute.

comment:9 in reply to:  8 Changed 10 months ago by teor

Replying to arma:

My thoughts for the first design would be:

  • No need for a complex backoff thing. Just remember the failure for 60 seconds, and during those 60 seconds, send back a destroy immediately. Reducing from n attempts per minute down to 1 per minute should be enough to help us survive until the clients get a consensus update and stop asking us to try.

60 seconds seems sensible.
And it's about the timeframe it takes to restart a relay (tor takes 30s to shut down by default).
Anything less would be ineffective.

  • We should avoid having this cached-failure thing impact client behavior. That is, it should cause *other people's* circuits to get destroys, but it shouldn't auto-fail our own client attempts. Maybe we should change how the client behaves, but if so, let's do it later, and not introduce subtle breakage in something we'll be considering for backport.

Excluding origin circuits seems fine.
(So does including them, but only on relays. But we already have backoff and limits on origin circuits.)

  • Hm! I was going to say "but rep_hist_note_connect_failed() won't work if the relay isn't in our consensus", but actually, it is simply based on intended identity digest of the next destination, so it does look like we can reuse the data struct. Except, shouldn't we also be caching the IP:port that failed? Otherwise somebody can ask us to extend to a victim digest at an unworkable IP:port, and we'll cache "failure!" and then refuse all the legit cells going to the right address for the next minute.

When we introduce IPv6 extends from relays (#24404), we only want to fail attempts to a single IP address, and not the whole relay. So dual-stack relays will get two attempts to other dual-stack relays: an IPv4 attempt, and an IPv6 attempt. I like that design, it makes sense to try both.

comment:10 Changed 10 months ago by dgoulet

Status: acceptedneeds_review

Ok, after discussion on IRC with arma a bit this morning, I've ended up with a branch.

This is untested (although it compiles and chutney is happy), there are no unit tests nor "in the wild" proper testing done on it for now.

I just want to throw it here as a full draft on this because 1) it adds quite a bit of code and 2) I want a review of the code logic before going in review mode.

Branch: bug24767_033_01

comment:11 Changed 10 months ago by asn

(FWIW, this has also been a finding of dawuud months ago , where he scanned the Tor network for partitions and he found that many high bw relays cannot contact other high bw relays. I guess he tried a single circuit build between those relays, so basically his findings are isomorphic to the ones from nusenu.)

comment:12 Changed 9 months ago by nickm

Keywords: review-group-32 added

comment:13 Changed 9 months ago by asn

Reviewer: asn

comment:14 Changed 9 months ago by teor

Keywords: 033-must added; must-fix-before-033-stable removed

must-fix-before-033-stable is now 033-must

comment:16 Changed 9 months ago by asn

Status: needs_reviewneeds_revision

Pushed unittest in my branch bug24767_033_01 this should provide almost 100% coverage of all the new code.

I also did a review on oniongit.

comment:17 Changed 9 months ago by dgoulet

Status: needs_revisionneeds_review

Ok review addressed.

The main commit is still "WIP" so once you ACK this branch, I'll have to squash + amend for the upstream merge (and add a changes file).

Second thing, all this code is currently flat in connection_or.c ... maybe we want it in dos.c or some other appropriate file? No strong opinion.

comment:18 Changed 9 months ago by asn

Fixes LGTM.

The commit at 33628d8 has various mistypes, so please address those too :)
Also don't forget to merge my unittest please.

Other than that, LGTM.

comment:19 Changed 9 months ago by dgoulet

Status: needs_reviewmerge_ready

Thanks! Fixed the typos (I hope all of them!). I've merged your unit test, squashed all the things, rebased on latest 033 and added a changes file. Putting in merge_ready for nickm eyes.

Oniongit: https://oniongit.eu/dgoulet/tor/merge_requests/22
Branch: bug24767_033_02

comment:20 Changed 9 months ago by teor

Status: merge_readyneeds_revision

This branch doesn't compile on macOS using clang:

  CC       src/or/connection_or.o
src/or/connection_or.c:1180:23: error: implicit conversion loses integer
      precision: 'uint64_t' (aka 'unsigned long long') to 'unsigned int'
      [-Werror,-Wshorten-64-to-32]
  unsigned int hash = tor_addr_hash(&entry->addr);
               ~~~~   ^~~~~~~~~~~~~~~~~~~~~~~~~~~
src/or/connection_or.c:1187:67: error: extra ';' outside of a function
      [-Werror,-Wextra-semi]
             or_connect_failure_ht_hash, or_connect_failure_ht_eq);
                                                                  ^
src/or/connection_or.c:1191:48: error: extra ';' outside of a function
      [-Werror,-Wextra-semi]
             0.6, tor_reallocarray_, tor_free_);

comment:21 Changed 9 months ago by teor

Reviewer: asnasn, teor

comment:22 Changed 9 months ago by teor

Status: needs_revisionneeds_review

There are two design issues here:

  • the HT implementation only uses 32 bit hashes on 64-bit LP64 systems, like macOS and the BSDs, and 32-bit systems (split off into #25365)
  • the port is user-controlled, so it needs to be hashed before being combined with the other hashes

I fixed the port issue, and the compilation issues in my branch bug24767_033_02 on https://github.com/teor2345/tor.git

Feel free to squash my extra commits if you'd like.

comment:23 in reply to:  22 ; Changed 9 months ago by dgoulet

Replying to teor:

There are two design issues here:

  • the HT implementation only uses 32 bit hashes on 64-bit LP64 systems, like macOS and the BSDs, and 32-bit systems (split off into #25365)
  • the port is user-controlled, so it needs to be hashed before being combined with the other hashes

What is the danger here? The length is fixed so what is the difference between "+= 42" or "+= h(42)" ?

I fixed the port issue, and the compilation issues in my branch bug24767_033_02 on https://github.com/teor2345/tor.git

For the hash fix, wouldn't it be more efficient to then combined addr + digest + port in a buffer and siphash that instead of doing 3 rounds of siphash?

comment:24 in reply to:  23 Changed 9 months ago by teor

Replying to dgoulet:

Replying to teor:

There are two design issues here:

  • the HT implementation only uses 32 bit hashes on 64-bit LP64 systems, like macOS and the BSDs, and 32-bit systems (split off into #25365)
  • the port is user-controlled, so it needs to be hashed before being combined with the other hashes

What is the danger here? The length is fixed so what is the difference between "+= 42" or "+= h(42)" ?

The hash needs to be unpredictable so that bad clients can't fill up one of your hash table buckets and cause your relay to slow down.
Adding the raw port gives the client direct control over 16 bits of your hash result, which makes the hash table less secure.

I fixed the port issue, and the compilation issues in my branch bug24767_033_02 on https://github.com/teor2345/tor.git

For the hash fix, wouldn't it be more efficient to then combined addr + digest + port in a buffer and siphash that instead of doing 3 rounds of siphash?

Hashing digest and port in a single buffer would be an easy win.
Hashing an extra buffer as part of tor_addr_hash() would also be great, because it avoids combining hashes with +.
But it would need a new argument to tor_addr_hash().
And we could fix the other use of tor_addr_hash() at the same time.

comment:25 Changed 9 months ago by dgoulet

Two fixup commits pushed to address teor's review above. First one to fix the hashing and the second for the compilation issue.

Same OnionGit
Branch: bug24767_033_02

comment:26 Changed 9 months ago by nickm

Keywords: review-group-34 added

comment:27 Changed 8 months ago by dgoulet

Reviewer: asn, teorasn

comment:28 Changed 8 months ago by nickm

Keywords: 033-triage-20180320 added

Marking all tickets reached by current round of 033 triage.

comment:29 Changed 8 months ago by nickm

Keywords: 033-included-20180320 added

Mark 033-must tickets as triaged-in for 0.3.3

comment:30 Changed 8 months ago by asn

Status: needs_reviewmerge_ready

LGTM! :)

comment:31 Changed 8 months ago by nickm

So on the hash function: You can't use a memcpy(tor_addr_t) like that, because some of the bytes of the tor_addr_t can be unspecified. I can fix that up.

On unable_connect_since_ts: Do we ever clear this field (re-set it to zero) on success? It seems like it just stays set forever. If so maybe a better name would be something like "last_failed_connect_ts"?

comment:32 Changed 8 months ago by nickm

Status: merge_readyneeds_revision

comment:33 Changed 8 months ago by nickm

Oh, and one more thing I noticed: It is not enough to just check "node_get_pref_orport(node, &node_addr_port);" -- the node may have multiple addresses, and "pref" just returns the one you would prefer to use yourself.

comment:34 in reply to:  31 ; Changed 8 months ago by dgoulet

Replying to nickm:

So on the hash function: You can't use a memcpy(tor_addr_t) like that, because some of the bytes of the tor_addr_t can be unspecified. I can fix that up.

Oh damn because of the union, if it is a v4, we'll copy bunch of uninit bytes from the larger v6. I guess setting the size according to the family is the way to do it.

On unable_connect_since_ts: Do we ever clear this field (re-set it to zero) on success? It seems like it just stays set forever. If so maybe a better name would be something like "last_failed_connect_ts"?

It is a timestamp, we don't need to reset it. It always being compared to a cutoff so even thought it is set once and never touched again, that is fine since it will at some point always pass the cutoff.

Oh, and one more thing I noticed: It is not enough to just check "node_get_pref_orport(node, &node_addr_port);" -- the node may have multiple addresses, and "pref" just returns the one you would prefer to use yourself.

Not sure it is a problem here. We use node_get_pref_orport() only to make sure the or connection we are about to note down as a failure matches the node_t we have from the identity digest. In this case, we use the node_t to keep the timestamp. If it doesn't match (because other port/addr for the identity), we note it down as a "unknown".

This ends up to be the same result but just more precisely indexed by addr + port + digest always.

comment:35 in reply to:  34 Changed 8 months ago by dgoulet

Status: needs_revisionneeds_review

Replying to dgoulet:

Replying to nickm:

So on the hash function: You can't use a memcpy(tor_addr_t) like that, because some of the bytes of the tor_addr_t can be unspecified. I can fix that up.

Oh damn because of the union, if it is a v4, we'll copy bunch of uninit bytes from the larger v6. I guess setting the size according to the family is the way to do it.

Ok, did an attempt of this in fixup 53a26acb6172b7eb. Not the prettiest but I don't see a way. Maybe we can abstract that in an inline function somehow like "tor_addr_get_ptr_size()" or sth around those lines?

Still branch: bug24767_033_02

comment:36 in reply to:  34 Changed 8 months ago by nickm

Replying to dgoulet:

Replying to nickm:

So on the hash function: You can't use a memcpy(tor_addr_t) like that, because some of the bytes of the tor_addr_t can be unspecified. I can fix that up.

Oh damn because of the union, if it is a v4, we'll copy bunch of uninit bytes from the larger v6. I guess setting the size according to the family is the way to do it.

On unable_connect_since_ts: Do we ever clear this field (re-set it to zero) on success? It seems like it just stays set forever. If so maybe a better name would be something like "last_failed_connect_ts"?

It is a timestamp, we don't need to reset it. It always being compared to a cutoff so even thought it is set once and never touched again, that is fine since it will at some point always pass the cutoff.

Is it okay with you if I change the name anyway? The problem is that the "since" implies a continuous condition. If I say "I have been unable to eat broccoli since 1982", it does not mean only that in 1982 I failed to eat broccoli: it also means that at every time since 1982, I have also been unable to eat broccoli.

Oh, and one more thing I noticed: It is not enough to just check "node_get_pref_orport(node, &node_addr_port);" -- the node may have multiple addresses, and "pref" just returns the one you would prefer to use yourself.

Not sure it is a problem here. We use node_get_pref_orport() only to make sure the or connection we are about to note down as a failure matches the node_t we have from the identity digest. In this case, we use the node_t to keep the timestamp. If it doesn't match (because other port/addr for the identity), we note it down as a "unknown".

This ends up to be the same result but just more precisely indexed by addr + port + digest always.

But the preferred address can change while Tor is running, I think, based on configuration options. That would make the value cached in the node become incorrect, and that's why I think we should maybe just not have this in the node_t at all.

comment:37 Changed 8 months ago by nickm

53a26acb6172b7eb looks okay to me. I'll write another patch to simplify it after this is merged.

comment:38 Changed 8 months ago by dgoulet

Is it okay with you if I change the name anyway? The problem is that the "since" implies a continuous condition. If I say "I have been unable to eat broccoli since 1982", it does not mean only that in 1982 I failed to eat broccoli: it also means that at every time since 1982, I have also been unable to eat broccoli.

Renamed unable_connect_since_ts in commit 0410a5e49c.

But the preferred address can change while Tor is running, I think, based on configuration options. That would make the value cached in the node become incorrect, and that's why I think we should maybe just not have this in the node_t at all.

If a node_t preferred address changes, indeed we won't match the connection to a node_t anymore and thus we will fallback to using the cache for unknown nodes. And that object in the cache will be used until the node_t gets modified by I guess the new descriptor at some point. So all in all, we'll still be tracking connection failure correctly.

The only reason why I'm still a bit hesitating on using the "unknown cache" for everything is maybe because of performance. If we would use that, at *every* connection, we would need to do a lookup where the hash function is a bit heavier with this addr + port + digest triplet than simply looking up the node list. Furthermore, every 60 seconds (OR_CONNECT_FAILURE_CLEANUP_INTERVAL), we cleanup that cache meaning we can iterate over thousands of objects in there (potentially more than the number of relays). I don't think it would be a significant cost but still important to keep in mind. Keeping it tiny can worth it.

However, just using the cache (and not the node_t), would make things a bit more simpler in the code, we would just need to accept the additional memory and lookup/iteration price.

My testing on a real relays shows that the majority of the times, we'll get our information out of the node_t.

At this point, I can lean towards both sides, I might just need a second opinion?

comment:39 Changed 8 months ago by dgoulet

Status: needs_reviewneeds_revision

Discussing this with nickm on IRC. We'll move this to only the cache instead of using the node_t to store the timestamp.

We'll also rename the "unknown cache" to something more meaningful like "node_addr_track_cache" or sth that tells us that we are actually tracking.

comment:40 Changed 8 months ago by dgoulet

Status: needs_revisionneeds_review

I've created a _03 branch which squashes everything because the latest changes basically modified half patch. Code should be simpler and more straight forward to understand. The use of the node_t has been completely removed.

PR: https://github.com/torproject/tor/pull/29
Branch: bug24767_033_03

comment:41 Changed 8 months ago by asn

Status: needs_reviewmerge_ready

Much better!

On a meta level, should we open a ticket about writing this behavior on our spec? Also, would it be better in the future, if we switch from static 60 seconds lifetime, to some sort of backoff algorithm, or to a consensus param? Should we open a ticket for this?

comment:42 Changed 8 months ago by nickm

Okay; much simpler. I've merged it to maint-0.3.3 and forward. Let's close this once there are follow-up tickets for the issues that asn raises in comment 41 above?

comment:43 in reply to:  41 Changed 8 months ago by dgoulet

Replying to asn:

Much better!

On a meta level, should we open a ticket about writing this behavior on our spec? Also, would it be better in the future, if we switch from static 60 seconds lifetime, to some sort of backoff algorithm, or to a consensus param? Should we open a ticket for this?

Yes! I think we could do much better in a second iteration. A backoff of some sort with knobs we can change through the consensus.

comment:44 Changed 8 months ago by dgoulet

Resolution: fixed
Status: merge_readyclosed

Ticket #25674 for the possible improvement. Closing this.

Note: See TracTickets for help on using tickets.