Opened 9 months ago

Closed 7 months ago

Last modified 7 months ago

#24902 closed enhancement (fixed)

Denial of Service mitigation subsystem

Reported by: dgoulet Owned by: dgoulet
Priority: Very High Milestone: Tor: 0.3.3.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-dos, tor-relay, review-group-30, 029-backport, 031-backport, 032-backport, review-group-31, SponsorV
Cc: Actual Points:
Parent ID: Points:
Reviewer: arma Sponsor:

Description

This ticket is for adding a denial of service mitigation subsystem to tor.

Because of the latest issues we've been having on the network with 1 million users most likely resulting in the huge loads on the relays we've been seeing, this subsystem is to provide a framework in order to add defense to tor for potential (voluntary or not) denial of service.

Child Tickets

TicketTypeStatusOwnerSummary
#24898defectclosedWe have two conflicting notions of channel_is_client()
#25094defectclosed24902 fix breaks on clang
#25128defectcloseddgouletBug: src/or/dos.c:312: cc_stats_refill_bucket: Non-fatal assertion new_circuit_bucket_count >= stats->circuit_bucket failed
#25183defectcloseddgouletImplement a way to tell if an IP address is a known relay
#25193defectcloseddgouletdos: Avoid blacklisting Exit relays
#25202defectcloseddgouletCheck the calculations in cc_stats_refill_bucket using non fatal assertions
#25223defectcloseddgouletdos: dos_new_client_conn: Non-fatal assertion !(entry == NULL) failed
#25236defectcloseddgouletdos: Document torrc default values in the man page when not in the consensus

Change History (86)

comment:1 Changed 9 months ago by dgoulet

Status: assignedneeds_review

I'm submitting today, for initial review, a branch that contains a basic skeleton of this subsystem and a DoS circuit creation mitigation feature.

This is a draft of the design and I hope to get it in a MUCH BETTER format to put in doc/ at some point if we want this:

https://people.torproject.org/~dgoulet/volatile/ddos-design.txt

Few things to mention.

First, there are no unit tests because before doing so I wanted more opinions on the design, engineering and overall structure of the code.

Second, this code has been running on my relay for ~4 days where more than 330 IPs have been identified has malicious and for which cells are being dropped (which is the defense in place).

Third, there could be still an issue with client traffic going through an Exit and back in the network, we need to address this or at least mitigate it as much as we can before we deploy.

Fourth, this feature is disabled by default and I would expect that in normal circumstances, it won't be used at all. I see this as a way to help out in situations like the one we are in right now.

Last thing, there is another possible mitigation with regards of high number of concurrent TCP connections doing tor2web. We are seeing that at high rate right now on the network (most likely scanning the "DarkWeb") but this branch is *NOT* about that but a detection/defense could take advantage of this code in many ways.

See branch: dgoulet/ddos_033_03

comment:2 Changed 9 months ago by cypherpunks

This seems like it may highly stress/kill off as well relays with old Tor versions when the DDoSers change their guard (due to this patch) and it eventually settles at some relay with an old Tor version.

comment:3 Changed 9 months ago by teor

As I suggested privately, I believe the best defense against tor traffic via an exit is to count unauthenticated (client, bridge, onion service) and authenticated (public relay) connections separately.

comment:4 in reply to:  2 Changed 9 months ago by dgoulet

Replying to cypherpunks:

This seems like it may highly stress/kill off as well relays with old Tor versions when the DDoSers change their guard (due to this patch) and it eventually settles at some relay with an old Tor version.

Yes that is one of the worry I do have. However, this circuit creation mitigation defense silently drop cells on a created circuit. In other words, clients will open circuits on the Guard and the Guard returns CREATED as a response so the client thinks it is valid and thus sends bunch of cells that are silently dropped by the Guard at that point.

I believe this makes the client not switch Guard and just keep sending stuff to the void. So the big Guard will soak up the load instead of spreading it out.

Not perfect but a first step towards better defense.

comment:5 in reply to:  3 Changed 9 months ago by dgoulet

Replying to teor:

As I suggested privately, I believe the best defense against tor traffic via an exit is to count unauthenticated (client, bridge, onion service) and authenticated (public relay) connections separately.

Yes indeed, that part is missing. I'm not entirely sure why we should track independently connections here, this DoS mitigation only tracks client connections.

So basically, I think we could do this for this extra "Exit detection" protection which would be to check if it is a known digest and maybe also check if we do have a matching non client channel for the address. What do you think?

comment:6 Changed 9 months ago by nickm

Keywords: review-group-30 added

comment:7 Changed 9 months ago by teor

It is probably ok to ignore relay connections for now, but if we ever get a DDoS via relays or by unpublished relays, we will be sorry and wish we had counted relays and clients separately.

comment:8 Changed 9 months ago by dgoulet

I've added the relay detection that was discussed with teor. Also, I've simplified things a bit after a discussion with asn on IRC about not using the circuit delta and circuit timeout value but rather a fixed maximum circuit count for which we can compute before in the consensus instead at the relay. And as per the discussion as well, I've added a cap of concurrent connections used in the circuit threshold equation. See top commit.

New branch name to reflect this ticket: ticket24902_033_01
OnionGit: https://oniongit.eu/dgoulet/tor/merge_requests/15

comment:9 Changed 9 months ago by nickm

Oh dear -- does this really have to be 0.3.3 only? Can this rebase cleanly back on to an older branch, so that we know how hard the backport will be?

Additionally, I'm going to start reviewing the branch on oniongit too

comment:10 Changed 9 months ago by nickm

I've made some initial comments on the oniongit branch; I see that others have been discussing too. This needs unit tests to be mergeable.

comment:11 in reply to:  9 Changed 9 months ago by dgoulet

Status: needs_reviewneeds_revision

Replying to nickm:

Oh dear -- does this really have to be 0.3.3 only? Can this rebase cleanly back on to an older branch, so that we know how hard the backport will be?

Hmmm I didn't thought this version would be a version to consider to backport nor even up to where we would want to backport.

Very little touches the current code base, most of it is in its own file so we should be able to backport this properly but will require different branches for each backported version I believe.

Additionally, I'm going to start reviewing the branch on oniongit too

Thanks! I'll look at it asap.

comment:12 Changed 9 months ago by nickm

Very little touches the current code base, most of it is in its own file so we should be able to backport this properly but will require different branches for each backported version I believe.

Actually, it might not. Often we can just write one branch against maint-0.2.9, and merge it forward into all the other branches.

comment:13 Changed 9 months ago by asn

OK I did a basic review of the code and the design.

I think the current code complexity stems from the slot/bucket design, and splitting the time periods into slots, marking them, and assessing the circuits based on slots. I think without the slot system the logic could be as simple as:

-> for every new circuit of this IP, nr_of_circuits++;
-> for every new conn of this ip, nr_of_conns++;
-> every N seconds, reset nr_of_circuits for this IP.
if (nr_of_conns > conn_magic_number || nr_of_circuits > circ_magic_number) {
   return DROP;
}
   return GOOD;

I understand that the slot design can eventually allow us to even block attackers with a single connection while allowing normal clients to do circuits bursts, but I'm questioning whether the complexity is worth it. Furthermore, it's possible that the slot system can be exploited by attackers, by really going all out during some 30 second slots, and staying more chill for the rest of them, and still getting a pass for the entire time period.

If we kill the slot system, we will get a very simple system but it will be less versatile, and we will need to have a bigger magic_number to be able to keep our false positives at a reasonable rate.

Last edited 9 months ago by asn (previous) (diff)

comment:14 Changed 9 months ago by dgoulet

Status: needs_revisionaccepted

Moving this back to "accepted" since a lot will change after IRC discussions. The new and hopefully simpler design is this now:

  1. Have a circuit token bucket per-IP which is refilled with some value at some rate defined by consensus parameters. Remove token from bucket every time a CREATE is seen. If bucket goes down to 0, activate defense if the number of concurrent connection is above a certain threshold defined by a consensus parameter.
  1. Detect high connection amount of connections per-IP and start closing connections for that IP if that reaches a too high threshold specified by a consensus parameter.
  1. Add a torrc option and/or consensus parameter to refuse client connection with ESTABLISH_RENDEZVOUS or in other words, an anti tor2web option at the relay. These have been observed to be quite problematic as people are running hundreds (if not thousands) of tor2web clients scanning the onion space. As collateral damage, it is loading relays with connections for rendezvous circuits. We could easily integrate that option with a certain threshold of parallel connection like "if I see 10 conn on that IP doing RDV, block".

I'm working on the new code for this.

comment:15 Changed 9 months ago by dgoulet

Keywords: 029-backport 031-backport 032-backport added
Priority: MediumVery High
Status: acceptedneeds_review

Hello again world!

Ok, some code is ready implementing the above. It defers greatly from the previous branch. Most of it has been simplified. This branch also now implements the 3 detections mentioned above which are (1) circuit creation DoS, (2) concurrent connection DoS and (3) ESTABLISH_RENDEZVOUS from client (tor2web) DoS.

Some stuff you might find useful to know before you dive in:

  • Each detection (listed above) have for now only one single type of defense implemented so if we think of more that we might want short term, now is a good time to get them in. Defenses can be selected by a consensus parameter.
  • For the (3) defense, I've gone quite explicitly with a torrc option (controlled also by a consensus param) to refuse tor2web client connections. There is no threshold no nothing for now because frankly I think tor2web clients are more hurting us than anything else by their ability to directly connect to all relays and thus induce resources pressure onto all relays "naturally"...
  • This code uses the geoip client cache which seems fine but has an interesting quirks. After 24h, an entry is wiped out which means we loose all the DoS mitigation statistics for the entry at that point. Not too bad because if the client address is still DoSing, it will be detected again and blocked. Wouldn't be too hard to not do that but would require a bit more code/thinking to clean it up so it doesn't grow infinitely.
  • The branch you are about to review is based on 029 considering a possible backport. If you want to test this on a relay that was previously >= 0.3.1, know that your client won't connect to it until you get a new consensus because it will be expecting an ed25519 key for which there is none used at 029. Either use an older client or merge forward to latest master by resolving the few conflicts there will be :).
  • All the DoS detection and defense are disabled by default. It requires a consensus param to be set for them to be enabled. So, if you want to test this on a relay, go in src/or/dos.h and change the enabled default values in there for both CC and CONN defenses.
  • A log has been added to the heartbeat so if it works, you should see such a line (real one!):
Jan 21 16:32:57.647 [notice] DoS mitigation since startup: 459085 cells rejected, 128 marked address. 235 MB have been dropped. 20126 connection rejected. 200 tor2web client refused.

Let us bikeshed on names here if you don't like them and please propose an alternative because this is currently the best I can come up with my "non-English-been-a-month-on-DoS-land brain" :).

Branch: ticket24902_029_01
Oniongit: https://oniongit.eu/dgoulet/tor/merge_requests/16

comment:16 Changed 9 months ago by teor

I think we should add two more Tor2web defences managed by a consensus parameter:

  • when an introduce cell is sent direct from a client, drop that cell and any extend requests
    • this is really important because it delays Tor2web introductions and failed introduction extends
  • drop HSDir lookups where the circuit came directly from a client

I think we should wait a release or two to turn the introduce and HSDir ones on.
But if it gets really bad, and we backport them to 0.2.9, maybe we can turn them on sooner.

I also think that Tor2web combined with single onion services makes a DDoS much more likely.
Neither end has any guards, and they both make single hop connections,
And we're not defending against that at all right now.

When the service side is a directly connected client (single onion service):

  • we should automatically activate the introduce defence
    • this is very effective, because it stops Tor2web straight away
  • we should automatically activate the rendezvous defence (drop all cells) as soon as the service connects
    • this is not very effective, because the rendezvous has established, but it's important for security

comment:17 Changed 9 months ago by dgoulet

I've gone over Roger's review in the oniongit. Some discussions are left to be answered.

asn will soon hand off to me a unittests branch (very awesome) so expect that at some point, I'll take over and put it in as an extra commit.

I think we should add two more Tor2web defenses managed by a consensus parameter:

Thanks teor for this, I 100% agree with you. What I'm wondering here is if we should take the time to also implement these and backport them or for now we only put in the RP one (which I think the worst one because clients do open the RP before doing the introduction) and put the others in 034+ ? If the later, I propose we open a new ticket for this "anti DoS + tor2web" issue because also at that point, if we end up with relays just denying direct client connections for HS purposes, we should start considering strongly to rip off the tor2web code from Tor. I won't start a "why do that discussion" in this ticket.

comment:18 in reply to:  17 ; Changed 9 months ago by teor

Replying to dgoulet:

...

I think we should add two more Tor2web defenses managed by a consensus parameter:

Thanks teor for this, I 100% agree with you. What I'm wondering here is if we should take the time to also implement these and backport them or for now we only put in the RP one (which I think the worst one because clients do open the RP before doing the introduction) and put the others in 034+ ? If the later, I propose we open a new ticket for this "anti DoS + tor2web" issue because also at that point, if we end up with relays just denying direct client connections for HS purposes, we should start considering strongly to rip off the tor2web code from Tor. I won't start a "why do that discussion" in this ticket.

Do we know if the extra load is bringing down HSDirs?
(The fetch creates more load, but HS descriptors are cached by clients.)

Let's open separate tickets for 0.3.4 for blocking Tor2web HSDir and Intro. And we should think about backporting the HSDir defence, because we will want it if the load gets worse.

We might also want to block single onion / Tor2web intros and rendezvous by default, and backport the code for security. The existing tickets are #22688 and #22689.

comment:19 in reply to:  18 Changed 9 months ago by dgoulet

Replying to teor:

Do we know if the extra load is bringing down HSDirs?
(The fetch creates more load, but HS descriptors are cached by clients.)

Can't say but it is possible yes.

Let's open separate tickets for 0.3.4 for blocking Tor2web HSDir and Intro. And we should think about backporting the HSDir defence, because we will want it if the load gets worse.

We might also want to block single onion / Tor2web intros and rendezvous by default, and backport the code for security. The existing tickets are #22688 and #22689.

I've created the high level parent ticket for all this tor2web mess in #24962. I've then created a ticket for HSDir and Intro (#24963 and #24964 respectively).

I've make #17945 a child of this #24962 parent ticket as well.

Lets put everything we got into that parent related to "Stop tor2web directly connecting to relays".

comment:20 Changed 9 months ago by dgoulet

Code update. I've addressed most of Nick's review. Few things that we need to figure out in there.

An _02 branch has been created which squashes the gazillion fixup commits in the _01 branch. The git diff is empty so that is that.

I've pushed two new commits on both branches obviously. One that makes us send back the DESTROY cell right after we get the CREATE instead of "at the next cell on that circuit".

Second commit is unit tests that asn did that I've ported.

The _01 merge request is still alive and kicking for discussions:
https://oniongit.eu/dgoulet/tor/merge_requests/16
Branch: ticket24902_029_01

The _02 merge request on if you prefer review a fresh branch:
https://oniongit.eu/dgoulet/tor/merge_requests/17
Branch: ticket24902_029_02

comment:21 Changed 9 months ago by nickm

We need to test this DESTROY change to make sure it still works the way we think. I'll go over the old branch and the new commits, as well as the new branch, later on tonight. :)

comment:22 in reply to:  21 Changed 9 months ago by dgoulet

Replying to nickm:

We need to test this DESTROY change to make sure it still works the way we think. I'll go over the old branch and the new commits, as well as the new branch, later on tonight. :)

Been running for more than 40 minutes on my relay. So far so good it appears. Here is the heartbeat after 30 minute of running:

Jan 22 20:42:04.863 [notice] DoS mitigation since startup: 286298 cells rejected, 127 marked address. 147 MB have been dropped. 15692 connection rejected. 3585 tor2web client refused.

What I'm wondering is if the fact that we don't send CREATED back makes the client switch Guard faster?

comment:23 Changed 9 months ago by dgoulet

Just did an experiment with a tor2web client with my relay pinned as the RP. Because it is the only RP available, the tor client happily retries *non* stop to connect to a working RP until the SocksTimeout that is 120 seconds by default. In this case, it creates many circuits to the same RP.

I think Roger opened a ticket recently about making tor client try to remember failing relays for specific things.

So just keep in mind that we'll have a _lot_ of tor2web clients trying every relays on the network if this defense becomes a thing everywhere.

comment:24 in reply to:  23 ; Changed 9 months ago by teor

What I'm wondering is if the fact that we don't send CREATED back makes the client switch Guard faster?

Tor2web clients don't use guards for any HS circuits. They're all direct connections to a randomly selected rend point. With no bandwidth weighting.

Replying to dgoulet:

Just did an experiment with a tor2web client with my relay pinned as the RP. Because it is the only RP available, the tor client happily retries *non* stop to connect to a working RP until the SocksTimeout that is 120 seconds by default. In this case, it creates many circuits to the same RP.

I think Roger opened a ticket recently about making tor client try to remember failing relays for specific things.

So just keep in mind that we'll have a _lot_ of tor2web clients trying every relays on the network if this defense becomes a thing everywhere.

Maybe avoiding the CREATED cell is not the best defence.

Reading circuit_expire_building(), Tor2web should time out on a rendezvous after 15 seconds:

SET_CUTOFF(stream_cutoff, MAX(options->CircuitStreamTimeout,15)*1000 + 1000);

We need to find a Tor2web defence that triggers some kind of delay on the client side. Our options are:

  • close the connection immediately (no delay?)
  • ignore all cells (min 500ms, initial 1500ms or cbtmintimeout consensus parameter)
  • ignore CREATE cells (min 500ms, initial 1500ms or cbtmintimeout consensus parameter)
  • ignore ESTABLISH_RENDEZVOUS cells (min 15s)

I think we should send back CREATED, and ignore the ESTABLISH_RENDEZVOUS, because that gets us a guaranteed minimum 15 second timeout.
We could increase the cbtmintimeout consensus parameter to a really high value. (Which seemed to work well on my relays.) But the client's timeout would only stay high if almost all relays delayed almost all circuits created by these clients.

comment:25 in reply to:  24 Changed 9 months ago by dgoulet

Replying to teor:

What I'm wondering is if the fact that we don't send CREATED back makes the client switch Guard faster?

Tor2web clients don't use guards for any HS circuits. They're all direct connections to a randomly selected rend point. With no bandwidth weighting.

This was not about tor2web clients. The largest part of the DoS is actual regular clients doing large amount of circuits and thus EXTEND2 arrive at the Guard in huge numbers. Previously, we would reject and close the circuit once we see a first cell. Now, this branch doesn't send back CREATED.

Replying to dgoulet:

Just did an experiment with a tor2web client with my relay pinned as the RP. Because it is the only RP available, the tor client happily retries *non* stop to connect to a working RP until the SocksTimeout that is 120 seconds by default. In this case, it creates many circuits to the same RP.

I think Roger opened a ticket recently about making tor client try to remember failing relays for specific things.

So just keep in mind that we'll have a _lot_ of tor2web clients trying every relays on the network if this defense becomes a thing everywhere.

Maybe avoiding the CREATED cell is not the best defence.

For the tor2web defense, right now it is refusing ESTABLISH_RENDEZVOUS from client connections and closing the circuit.

Reading circuit_expire_building(), Tor2web should time out on a rendezvous after 15 seconds:

SET_CUTOFF(stream_cutoff, MAX(options->CircuitStreamTimeout,15)*1000 + 1000);

We need to find a Tor2web defence that triggers some kind of delay on the client side. Our options are:

  • close the connection immediately (no delay?)

This is what is happening at the ESTABLISH_RDV cell is seen.

  • ignore all cells (min 500ms, initial 1500ms or cbtmintimeout consensus parameter)
  • ignore CREATE cells (min 500ms, initial 1500ms or cbtmintimeout consensus parameter)
  • ignore ESTABLISH_RENDEZVOUS cells (min 15s)

I think we should send back CREATED, and ignore the ESTABLISH_RENDEZVOUS, because that gets us a guaranteed minimum 15 second timeout.

Hmmm so dropping the cell instead of sending back a DESTROY. Sounds promising! And if these are in large numbers of connections, the second defense will kick in that is the "max concurrent connection".

comment:26 in reply to:  15 Changed 9 months ago by arma

Replying to dgoulet:

  • This code uses the geoip client cache which seems fine but has an interesting quirks. After 24h, an entry is wiped out which means we loose all the DoS mitigation statistics for the entry at that point.

Is this another way that the concurrent conn count could get out of sync?

comment:27 in reply to:  18 Changed 9 months ago by arma

Replying to teor:

Do we know if the extra load is bringing down HSDirs?
(The fetch creates more load, but HS descriptors are cached by clients.)

I heard from one of the affected onion services that some of their HSDirs tend to go offline. So I'm going to say "yes it does happen".

comment:28 in reply to:  24 Changed 9 months ago by arma

Replying to teor:

Tor2web clients don't use guards for any HS circuits. They're all direct connections to a randomly selected rend point. With no bandwidth weighting.

For the historical record, they do use bandwidth weighting, except in the case where they additionally set the Tor2webRendezvousPoints config option (which I assume few of them do).

comment:29 in reply to:  24 ; Changed 9 months ago by arma

Replying to teor:

  • ignore ESTABLISH_RENDEZVOUS cells (min 15s)

I think we should send back CREATED, and ignore the ESTABLISH_RENDEZVOUS, because that gets us a guaranteed minimum 15 second timeout.

Agreed. That's what I've been doing on my hacked-up relay:

@@ -249,6 +251,14 @@ rend_mid_establish_rendezvous(or_circuit_t *circ, const uint8_t *request,                                                              
     goto err;                                                                  
   }                                                                            
                                                                                
+  if (channel_is_client(circ->p_chan)) {                                       
+    log_info(LD_REND,
+      "DEFENSE: dropped ESTABLISH_RENDEZVOUS on circuit %u, prev IP %s",
+      (unsigned)circ->p_circ_id,
+      channel_get_actual_remote_descr(circ->p_chan));
+    return 0; // quietly drop it, and let it time out
+  }
+
   /* Acknowledge the request. */
   if (relay_send_command_from_edge(0,TO_CIRCUIT(circ),
                                    RELAY_COMMAND_RENDEZVOUS_ESTABLISHED,

and I think it's a good choice here too.

We could increase the cbtmintimeout consensus parameter to a really high value. (Which seemed to work well on my relays.) But the client's timeout would only stay high if almost all relays delayed almost all circuits created by these clients.

No, I think the only way to get a higher timeout for establish-rendezvous attempts is if the user manually set their options->CircuitStreamTimeout. The code as you say is

  /* CIRCUIT_PURPOSE_C_ESTABLISH_REND behaves more like a RELAY cell.
   * Use the stream cutoff (more or less). */
  SET_CUTOFF(stream_cutoff, MAX(options->CircuitStreamTimeout,15)*1000 + 1000);

which does not reference get_circuit_build_timeout_ms(). :(

comment:30 in reply to:  29 Changed 9 months ago by teor

Replying to arma:

Replying to teor:

We could increase the cbtmintimeout consensus parameter to a really high value. (Which seemed to work well on my relays.) But the client's timeout would only stay high if almost all relays delayed almost all circuits created by these clients.

No, I think the only way to get a higher timeout for establish-rendezvous attempts is if the user manually set their options->CircuitStreamTimeout. The code as you say is

  /* CIRCUIT_PURPOSE_C_ESTABLISH_REND behaves more like a RELAY cell.
   * Use the stream cutoff (more or less). */
  SET_CUTOFF(stream_cutoff, MAX(options->CircuitStreamTimeout,15)*1000 + 1000);

which does not reference get_circuit_build_timeout_ms(). :(

I was talking about dropping other types of cells earlier in circuit construction. Those purposes reference get_circuit_build_timeout_ms().

comment:31 Changed 9 months ago by dgoulet

LOTS of fixes went in.

See _01 branch for all the fixups and new commits.

The _02 branch is squashed with all the things.

Now an _03 exists for which we can have a new nice merge requests without breaking the _02 one on Oniongit: https://oniongit.eu/dgoulet/tor/merge_requests/18

comment:32 Changed 9 months ago by asn

Please check branch ticket24902_refill_test in my repo for another unittest, testing the token bucket refill procedure! Cheers!

comment:33 Changed 9 months ago by dgoulet

Branch _01 has all the fixups and new commit from the review.

Branch _03 is squashed with the latest.

Everything includes asn's latest unit tests as well. Might be a bit difficult to follow the latest because a lot has been fixed and added so let me know how I can make this as smooth as possible.

comment:34 Changed 9 months ago by dgoulet

The branches were becoming insane in terms of commit logic and straight usefulness. I've created a brand new clean _04 branch that has fewer commits, better separation and logic. We should be at an almost final version here.

Branch: ticket24902_029_04
OnionGit: https://oniongit.eu/dgoulet/tor/merge_requests/19

The git diff with _01/_03 is one single include that has been added in config.c for completion.

comment:35 Changed 9 months ago by nickm

Keywords: review-group-31 added

comment:36 Changed 9 months ago by arma

I've been running dgoulet's ticket24902_029_04 with the following additional patch to make channel_is_client() not be a lie:

diff --git a/src/or/channeltls.c b/src/or/channeltls.c
index 09cca95..7ad8a8c 100644
--- a/src/or/channeltls.c
+++ b/src/or/channeltls.c
@@ -1640,6 +1640,7 @@ channel_tls_process_netinfo_cell(cell_t *cell, channel_tls
         tor_assert(tor_digest_is_zero(
                   (const char*)(chan->conn->handshake_state->
                       authenticated_peer_id)));
+        channel_mark_client(TLS_CHAN_TO_BASE(chan));
         channel_set_circid_type(TLS_CHAN_TO_BASE(chan), NULL,
                chan->conn->link_proto < MIN_LINK_PROTO_FOR_WIDE_CIRC_IDS);
 
diff --git a/src/or/connection_or.c b/src/or/connection_or.c
index dadfdc4..8beedca 100644
--- a/src/or/connection_or.c
+++ b/src/or/connection_or.c
@@ -1880,6 +1880,12 @@ connection_or_set_state_open(or_connection_t *conn)
   connection_or_change_state(conn, OR_CONN_STATE_OPEN);
   control_event_or_conn_status(conn, OR_CONN_EVENT_CONNECTED, 0);
 
+  /* Link protocol 3 appeared in Tor 0.2.3.6-alpha, so any connection
+   * that uses an earlier link protocol should not be treated as a relay. */
+  if (conn->link_proto < 3) {
+    channel_mark_client(TLS_CHAN_TO_BASE(conn->chan));
+  }
+
   or_handshake_state_free(conn->handshake_state);
   conn->handshake_state = NULL;
   connection_start_reading(TO_CONN(conn));

comment:37 Changed 9 months ago by arma

nickm set this one to a review-group. Shouldn't we be looking at an 033-oriented branch though, since that's what we'll merge first? Or is the plan to merge the whole thing straight into 029 and then forward, and then put out an 033 with it, and hope that we don't need a new 029 before we've tested it enough? :)

comment:38 in reply to:  37 ; Changed 9 months ago by teor

Replying to arma:

I've been running dgoulet's ticket24902_029_04 with the following additional patch to make channel_is_client() not be a lie:

Is this extra patch any different from #24898?
Because if it's not, we should rebase this branch on the 0.2.9 backport.
For details, see https://trac.torproject.org/projects/tor/ticket/24898#comment:10
Otherwise we won't be testing what we actually plan on deploying.

Replying to arma:

nickm set this one to a review-group. Shouldn't we be looking at an 033-oriented branch though, since that's what we'll merge first? Or is the plan to merge the whole thing straight into 029 and then forward, and then put out an 033 with it, and hope that we don't need a new 029 before we've tested it enough? :)

We can merge it to 0.3.3 soon, and then merge it to 0.2.9 later. Backport merges don't cause any conflicts as long as the commits are the same.

comment:39 in reply to:  38 Changed 9 months ago by arma

Replying to teor:

Replying to arma:

with the following additional patch to make channel_is_client() not be a lie:

Is this extra patch any different from #24898?
Because if it's not, we should rebase this branch on the 0.2.9 backport.
For details, see https://trac.torproject.org/projects/tor/ticket/24898#comment:10
Otherwise we won't be testing what we actually plan on deploying.

Correct, it is essentially those two changes listed in #24898. The commits you point to there have a bit more comments around the new lines, but the code is the same.

(I decided on those two things as the required 0.2.9 changes independently from #24898, so that's a good sign.)

comment:40 Changed 9 months ago by arma

If people are into writing more unit tests, 58789c1 is a great commit for adding them: it is a potentially tricky operation on our connection list that will happen very rarely in deployed relays, so if it is buggy it's going to be a long time until we find out (and potentially a lot longer until we can narrow a bug down to this commit).

comment:41 Changed 9 months ago by arma

+             "DoS mitigation since startup:%s%s%s",

We should make a decision here about whether heartbeat info should be "since last heartbeat" or "since startup". I think we have a mixture of it right now:

Jan 29 01:48:16.972 [notice] Heartbeat: Tor's uptime is 1 day 11:59 hours, with 207915 circuits open. I've sent 4020.40 GB and received 4106.65 GB.
Jan 29 01:48:16.972 [notice] Circuit handshake stats since last time: 1456490/1456490 TAP, 85381513/85381513 NTor.
Jan 29 01:48:16.972 [notice] Since startup, we have initiated 0 v1 connections, 0 v2 connections, 8 v3 connections, and 34886 v4 connections; and received 457 v1 connections, 41063 v2 connections, 78264 v3 connections, and 657961 v4 connections.
Jan 29 01:48:16.972 [notice] DoS mitigation since startup: 5094873 cells rejected, 40 marked address. 2618 MB have been dropped. 4238710 connection rejected. 413638 single hop client refused.

Looks like bandwidth info, connection info, and now DoS info, are since startup. Whereas circuit handshake info is since last heartbeat.

I would think that for DoS info, like circuit info, the thing I most want to know is "very recently, what happened"? So I personally would prefer the "since last time" data. But I can totally see this going either way.

Speaking of heartbeat, "40 marked address" doesn't tell me how many addresses are being rejected *right now*. In fact, this could be a single address that got marked 40 times since startup of my relay? (I guess not quite because I have 36 hours of uptime and there were 40 marked addresses, but it's close.)

comment:42 Changed 9 months ago by arma

+The following options are useful only for a public relay
[...]
+[[DoSRefuseSingleHopClientRendezvous]] **DoSRefuseSingleHopClientRendezvous** 

But it looks like the call to dos_should_refuse_single_hop_client() doesn't care whether public_server_mode().

I think that's fine, and the simple fix is to stop claiming that DoSRefuseSingleHopClientRendezvous only works on a public relay.

Version 0, edited 9 months ago by arma (next)

comment:43 Changed 9 months ago by arma

+  if (dos_cc_get_defense_type(chan) == DOS_CC_DEFENSE_REFUSE_CELL) {
+    channel_send_destroy(cell->circ_id, chan,
+                         END_CIRC_REASON_TORPROTOCOL);

Should we use END_CIRC_REASON_RESOURCELIMIT instead?

I could see us using TORPROTOCOL but in that case we should add something to the tor specs describing what behavior is violating the protocol.

For comparison, we use END_CIRC_REASON_RESOURCELIMIT right now when our create cell queue is too full so we're preemptively failing the circuit.

comment:44 Changed 9 months ago by arma

get_circuit_rate_per_second() still isn't doing what I think you wanted.

Let's say dos_cc_circuit_rate_tenths is 19, i.e. 1.9 circuits per second.

Then get_circuit_rate_per_second() will return dos_cc_circuit_rate_tenths / 10 which is 1. Then later you'll compute num_token = elapsed_time_last_refill * circuit_rate and you'll be adding 1 circ per second to the token bucket.

I think if you want to keep the "tenths" notion, you need the rounding-down to happen when you're computing num_token, and not before. That is, you want to compute elapsed_time_since_refill * dos_cc_circuit_rate_tenths and then divide *that* by 10.

(It still won't be totally accurate, since whenever you get a create cell you'll call this refill function and discard the "fractional" circuit that you didn't add to the bucket. But I think that's ok.)

Edit: I am still totally fine dropping the "tenths" notion and just having circuit_rate be in whole numbers -- especially given that if we're getting at least 1 create cell each second, then we're going to be rounding down each second anyway, so the tenths notion only really kicks in when many seconds go by between create cells, and those aren't the scary situations.

Edit edit: I guess you could also make the buckets themselves be in tenths. And then subtract 10 from the bucket each time you handle a circuit. Then you could accumulate multiple fractions of a circuit over time, and the calculations over time would be more accurate. But I remain fine with the simpler whole number approach. :)

Last edited 9 months ago by arma (previous) (diff)

comment:45 Changed 9 months ago by arma

  /* Number of allowed circuit rate that is this value is refilled at a rate
   * defined by the consensus plus a bit of random. It is decremented every
   * time a new circuit is seen for this client address and if the count goes
   * to 0, we have a positive detection. */
  uint32_t circuit_bucket;

(a) should be "Number of allocated circuits remaining for this address", i.e. it's not a rate, it's a size.

(b) What's this about "plus a bit of random"? :)

comment:46 Changed 9 months ago by arma

And thus ends my review. Looking good!

I've been trying to figure out if I would want to set the consensus params with these defaults -- "if 100 concurrent conns, ones after that are refused" and "90 circuits, refilled 3 per second" -- and I think yes I am comfortable with those.

In the future, I plan to advocate for merging dos_cc_new_create_cell() and dos_cc_get_defense_type() into a single function, which notes the existence of the new create cell and also tells us whether to apply a defense. And I plan to advocate for a second cc defense, which returns DOS_CC_DEFENSE_REFUSE_CELL simply when stats->cc_stats.circuit_bucket == 0, without any marking or checking of stats->concurrent_count. I think I will want to instrument a real relay to see how often it would trigger that new defense, though, and I am happy to delay my future plans so we can get this patch out the door.

comment:47 in reply to:  41 Changed 9 months ago by dgoulet

Replying to arma:

I would think that for DoS info, like circuit info, the thing I most want to know is "very recently, what happened"? So I personally would prefer the "since last time" data. But I can totally see this going either way.

I implemented that before but then I switched because I wanted to have a big picture of the DoS where stats every heartbeat gives you an idea of the "right now" situation.

I do think both would be useful tbh because for instance the "marked address" will go to some number then at some point will be 0 all the time because your tor marked all the addresses so that could be a bit confusing. Wouldn't be complicated to have both counts, a long term one and a "since last heartbeat" ?

Speaking of heartbeat, "40 marked address" doesn't tell me how many addresses are being rejected *right now*. In fact, this could be a single address that got marked 40 times since startup of my relay? (I guess not quite because I have 36 hours of uptime and there were 40 marked addresses, but it's close.)

You can "double mark" an address only if it is marked once then removed from the geoip cache and then it comes back and marked again. In that case, the counter will do a ++ twice for the same address.

Once the marked_until_ts is set, it is never put back to 0 so it can't be counted twice unless the entry is removed from the geoip cache.

comment:48 in reply to:  43 ; Changed 9 months ago by dgoulet

Reviewer: arma

First, I had to do a fixup unrelated to the review: 9d2699cad0

Replying to arma:

Should we use END_CIRC_REASON_RESOURCELIMIT instead?

Good point. Fixup commit 2d9b285f98.

But it looks like the call to dos_should_refuse_single_hop_client() doesn't care whether public_server_mode().

Agree. Fixup commit: ab7b9581f3

get_circuit_rate_per_second() still isn't doing what I think you wanted.

Yah so thanks to Hello71 on IRC, I realized that there was an issue indeed and all your "Edit:" were indeed discussed and proposed by Hello71 :).

I kind of think dropping the "tenths" would be good because, as you said, at best we can fill the bucket every second (approx_time()). So everything is rounded off to the second anyway. And a considerable time gap between CREATE cell will make a small difference which probably won't matter at that point.

Thus, going to a number of circuit per second instead of "tenths" seems the improvement to do.

What about this commit? 2106a5eaa8

(a) should be "Number of allocated circuits remaining for this address", i.e. it's not a rate, it's a size.

Fixup commit: c7099765b4

(b) What's this about "plus a bit of random"? :)

Don't know what you are talking about, I can't find this string in the code :S ...

In the future, I plan to advocate for merging dos_cc_new_create_cell() and dos_cc_get_defense_type() into a single function, which notes the existence of the new create cell and also tells us whether to apply a defense. And I plan to advocate for a second cc defense, which returns DOS_CC_DEFENSE_REFUSE_CELL simply when stats->cc_stats.circuit_bucket == 0, without any marking or checking of stats->concurrent_count. I think I will want to instrument a real relay to see how often it would trigger that new defense, though, and I am happy to delay my future plans so we can get this patch out the door.

Agree on both! Once we get this merged, lets open tickets for that!

Final fixup on the man page (thanks to Hello71!): 1b8835fccd

comment:49 Changed 9 months ago by dgoulet

Because we'll merge this in 033 before backporting, I'm happy to provide an 033 branch if needed, the merge has some conflicts (trivial but some). Let me know.

comment:50 Changed 9 months ago by teor

Code review:

"dos.h" already exists in some Windows environments. We might want to pick another name.

The dos_*refuse_single_hop_client functions and variables are confusing. They only apply to v2 rendezvous, but they are named like they apply to all clients. Please add the word "rend".

Should we implement this defence for v3 rendezvous as well?
(We have a separate ticket for single hop HSDir and intro.)

comment:51 in reply to:  50 Changed 9 months ago by teor

Replying to teor:

...
The dos_*refuse_single_hop_client functions and variables are confusing. They only apply to v2 rendezvous, but they are named like they apply to all clients. Please add the word "rend".

Should we implement this defence for v3 rendezvous as well?
(We have a separate ticket for single hop HSDir and intro.)

Oops. Rend points are the same for v2 and v3. So the defence applies to both. Thanks arma!

comment:52 in reply to:  48 Changed 9 months ago by arma

Replying to dgoulet:

But it looks like the call to dos_should_refuse_single_hop_client() doesn't care whether public_server_mode().

Agree. Fixup commit: ab7b9581f3

(A) I think this one is missing a !.

(B) Yes, an 0.3.3 branch would be good so we have something to actually merge.

(C), it wants a changes file. Here's a start:

  o Major features:
    - Give relays some defenses against the recent network overload. We
      start with three defenses (default parameters in parentheses).
      First: if a single client address makes too many connections
      (>100), hang up on further connections. Second: if a single client
      address makes circuits too quickly (more than 3 per second, with
      an allowed burst of 90) while also having too many connections open
      (3), refuse new create cells for the next while (1-2 hours). Third:
      if a client asks to establish a rendezvous point to you directly,
      ignore the request. These defenses can be manually controlled
      by new torrc options, but relays will also take guidance from
      consensus parameters, so there's no need to configure anything
      manually. Implements ticket 24902.

Looking good!

comment:53 Changed 9 months ago by arma

I pushed a commit to my ticket24902_029_04-fixup branch that you might like -- it cleans up the heartbeat messages a bit.

comment:54 in reply to:  50 Changed 9 months ago by dgoulet

Status: needs_reviewmerge_ready

Replying to teor:

"dos.h" already exists in some Windows environments. We might want to pick another name.

We don't use it in the tor code base so I doubt there could be any kind of confusion with the dos.h Windows?... Maybe if we would use #include <dos.h> ?

Else we could rename with something like defense.c or mitigation.c or dunno... ? I think we can do this after the upstream merge.

(A) I think this one is missing a !.

Indeed. Fixup: 12761ce3685c9d57

(C), it wants a changes file. Here's a start:

Very nice! I've added it in commit: ea5d3bf4d5188511

I pushed a commit to my ticket24902_029_04-fixup branch that you might like -- it cleans up the heartbeat messages a bit.

Very nice again, I've cherry-picked it in commit: 8c6833be7d6e631d

Ok, I think with teor and arma happy now, we can proceed with a merge_ready state. For this, I've created a _05 branch squashing the _04 branch fixup commits and created an OnionGit MR in case Nick wants to comment on it.

Branch: ticket24902_029_05
Oniongit: https://oniongit.eu/dgoulet/tor/merge_requests/20

Then we have an 0.3.3 branch as well (based on latest master) in case Nick wants to pick that one and not deal with the 029 merge into master.

Branch: ticket24902_033_02

comment:55 Changed 9 months ago by arma

Love it!

comment:56 Changed 9 months ago by nickm

I've merged the 24902_033_02 branch into master. If it doesn't explode in the next alpha, we can do a backport.

comment:57 Changed 9 months ago by teor

Status: merge_readyneeds_revision

See #25094 for fixes to clang errors, including better overflow checks in cc_stats_refill_bucket, and some more unit tests.

dgoulet, can you review my bug25094, and rebase it onto your ticket24902_029_05?

Also, I found another bug: when the previous time is zero, clients get infinite refills. This probably only affects shadow.

comment:58 in reply to:  57 ; Changed 9 months ago by arma

Replying to teor:

Also, I found another bug: when the previous time is zero, clients get infinite refills. This probably only affects shadow.

Shouldn't you get one refill at the beginning, and not after that? Surely shadow doesn't spend all of its time at midnight the morning of Jan 1 1970?

comment:59 Changed 9 months ago by teor

If the previous and current times are zero, you get infinite refills.
Your suggestion in #25094 to return early will fix both bugs.

comment:60 Changed 9 months ago by dgoulet

Status: needs_revisionaccepted

Fixes in #25094 has been merged and they are in the _029_05 branch.

Putting back in accepted state waiting for backport.

Last edited 9 months ago by dgoulet (previous) (diff)

comment:61 in reply to:  58 Changed 9 months ago by robgjansen

Replying to arma:

Replying to teor:

Also, I found another bug: when the previous time is zero, clients get infinite refills. This probably only affects shadow.

Shouldn't you get one refill at the beginning, and not after that? Surely shadow doesn't spend all of its time at midnight the morning of Jan 1 1970?

As of these commits (latest one today), Shadow now emulates to Tor that the simulation starts at 946684800 seconds since the epoch, i.e., the simulation starts on January 1st, 2000.

comment:62 Changed 9 months ago by asn

Keywords: SponsorV added

comment:63 Changed 9 months ago by teor

dgoulet/ticket24902_029_05 is missing commit d2ae1bfcb3, which removes the redundant semicolon in test_dos.c

comment:64 in reply to:  63 Changed 9 months ago by dgoulet

Replying to teor:

dgoulet/ticket24902_029_05 is missing commit d2ae1bfcb3, which removes the redundant semicolon in test_dos.c

Oh didn't know about that one. Cherry-picked and pushed.

comment:65 Changed 9 months ago by dgoulet

Status: acceptedmerge_ready

Ok at this point, #25193 and #25183 are merged into this branch. They've both been ACK by nickm.

This imo closes the loop for this. Testing is ongoing on the 033 branch so then backport should be near.

comment:66 Changed 9 months ago by nickm

merged that to master again; still backportable.

comment:67 Changed 8 months ago by teor

We've been testing this patch across 16 relays over the weekend. (And we disabled all the statistics options, because at least two of them cause massive RAM bloat.)

RAM usage is down to about a gigabyte per relay.
(Previously, it was up to 10 GB per relay.)

On our largest guard, consensus weight 10x,xxx, we have the following heartbeat:

[notice] Heartbeat: Tor's uptime is x days xx:xx hours, with 25xxxx circuits open. I've sent 57xx.xx GB and received 57xx.xx GB.
[notice] Circuit handshake stats since last time: 25xxxxx/25xxxxx TAP, 27xxxxxx/27xxxxxx NTor.
[notice] Since startup, we have initiated x v1 connections, x v2 connections, x v3 connections, and 41xxx v4 connections; and received x v1 connections, 44xxx v2 connections, 58xxx v3 connections, and 50xxxx v4 connections.
[notice] DoS mitigation since startup: 56xxxx circuits rejected, 2x marked addresses. 0 connections closed. 24xx single hop clients refused.

I'm about to remove all our custom DoS mitigations, including the firewall. I'll report back in a day or two on how that goes.

comment:68 Changed 8 months ago by teor

Status: merge_readyneeds_revision

The man page doesn't document the defaults when there is no consensus parameter set:

/* DoSCircuitCreationEnabled default. Disabled by default. */
#define DOS_CC_ENABLED_DEFAULT 0
/* DoSCircuitCreationDefenseType maps to the dos_cc_defense_type_t enum. */
#define DOS_CC_DEFENSE_TYPE_DEFAULT DOS_CC_DEFENSE_REFUSE_CELL
/* DoSCircuitCreationMinConnections default */
#define DOS_CC_MIN_CONCURRENT_CONN_DEFAULT 3
/* DoSCircuitCreationRateTenths is 3 per seconds. */
#define DOS_CC_CIRCUIT_RATE_DEFAULT 3
/* DoSCircuitCreationBurst default. */
#define DOS_CC_CIRCUIT_BURST_DEFAULT 90
/* DoSCircuitCreationDefenseTimePeriod in seconds. */
#define DOS_CC_DEFENSE_TIME_PERIOD_DEFAULT (60 * 60)
...
/* DoSConnectionEnabled default. Disabled by default. */
#define DOS_CONN_ENABLED_DEFAULT 0
/* DoSConnectionMaxConcurrentCount default. */
#define DOS_CONN_MAX_CONCURRENT_COUNT_DEFAULT 100
/* DoSConnectionDefenseType maps to the dos_conn_defense_type_t enum. */
#define DOS_CONN_DEFENSE_TYPE_DEFAULT DOS_CONN_DEFENSE_CLOSE

And there is no #define for the default DoSRefuseSingleHopClientRendezvous setting in dos.h. Its default should be documented in the man page, even if we don't add a #define.

Also, the reference to DoSCircuitCreationRateTenths in the above code is outdated.

comment:69 Changed 8 months ago by teor

My relay radia4 became unmeasured shortly after I disabled my firewall and started relying on the DDoS defences. And then a few hours later, it was measured again.

I've checked that it's reachable on IPv4 and IPv6, and that the remaining firewall rules aren't blocking anything (unless the authorities are making *lots* of connections).

Could the authorities (or the bandwidth authority clients) be triggering one of the defences?
Aren't authorities meant to be exempted as relays?
Perhaps the bandwidth authority clients are building too many circuits?

(In particular, maatuska, bastet, moria1, and faravahar stopped voting for it.)

https://atlas.torproject.org/#details/C6B3546CC6BCCB649FEC82D348D464554BC6323D
https://consensus-health.torproject.org/consensus-health-2018-02-13-01-00.html#C6B3546CC6BCCB649FEC82D348D464554BC6323D

It's back up now, so it's not a big deal. But we should watch out for missing relay votes during 0.3.3.2-alpha.

Edit: it was re-measured, not down

Last edited 8 months ago by teor (previous) (diff)

comment:70 in reply to:  69 ; Changed 8 months ago by dgoulet

Replying to teor:

My relay radia4 became unmeasured shortly after I disabled my firewall and started relying on the DDoS defences. And then a few hours later, it was measured again.

I've checked that it's reachable on IPv4 and IPv6, and that the remaining firewall rules aren't blocking anything (unless the authorities are making *lots* of connections).

Could the authorities (or the bandwidth authority clients) be triggering one of the defences?
Aren't authorities meant to be exempted as relays?

For reachability test, authority opens a one-hop circuit to the relay and it is authenticated right? But anycase, there is no defense applied for known IPs and I assume dirauth are very known.

Perhaps the bandwidth authority clients are building too many circuits?

If the bwauth is opening more than 3 concurrent connections and doing on them 90 circuits burst at a rate of 3 circuit/second, then yes that is *crazy* and would trigger the defense. Or if it is opening more than 100 TCP connections in parallel, all the other connections would get refused.

Edit: it was re-measured, not down

The defense would be up for 60 minutes + rand(1, 30) minutes so if it was re-measured somehow properly without triggering the defense, I think that either the bwauth is on the edge there or it is not that.

If the bwauth aren't opening that many circuits, I would blame the network load or/and bwauth code?

comment:71 Changed 8 months ago by dgoulet

For the man page comment, I propose we start opening ticket to fix those because the ticket24902_029_05 branch is becoming a monster that must be insync with 033. So I think I would avoid anything new in that branch until we backport it and then we apply the fixes on upstream 029 ?

comment:72 Changed 8 months ago by dgoulet

Keywords: tor-dos added; ddos removed
Status: needs_revisionmerge_ready

For the man page fixes: #25236

Setting this back in merge_ready for the backport.

comment:73 Changed 8 months ago by teor

Another operator reports that their relay dropped out of the consensus for a few hours while running the new DoS code:
https://lists.torproject.org/pipermail/tor-relays/2018-February/014503.html

comment:74 Changed 8 months ago by Felixix

Not only with the new code. It was observed with 32x even more often with laxer fw settings. What offers the early conclusion that in this case 90/100 on 33x acts similar to 32x. 50/50 on 33x does not show it.
(Is DOS_CC_CIRCUIT_BURST_DEFAULT/DOS_CONN_MAX_CONCURRENT_COUNT_DEFAULT)

Last edited 8 months ago by Felixix (previous) (diff)

comment:75 in reply to:  70 Changed 8 months ago by teor

Replying to dgoulet:

Replying to teor:

My relay radia4 became unmeasured shortly after I disabled my firewall and started relying on the DDoS defences. And then a few hours later, it was measured again.

I've checked that it's reachable on IPv4 and IPv6, and that the remaining firewall rules aren't blocking anything (unless the authorities are making *lots* of connections).

Could the authorities (or the bandwidth authority clients) be triggering one of the defences?
Aren't authorities meant to be exempted as relays?

For reachability test, authority opens a one-hop circuit to the relay and it is authenticated right?

Yes.

But anycase, there is no defense applied for known IPs and I assume dirauth are very known.

It depends. If authorities set OutboundBindAddress, or their default route is through a non-public address, then their IPs won't be known. But they will be authenticated.

Perhaps the bandwidth authority clients are building too many circuits?

If the bwauth is opening more than 3 concurrent connections and doing on them 90 circuits burst at a rate of 3 circuit/second, then yes that is *crazy* and would trigger the defense. Or if it is opening more than 100 TCP connections in parallel, all the other connections would get refused.

Edit: it was re-measured, not down

The defense would be up for 60 minutes + rand(1, 30) minutes so if it was re-measured somehow properly without triggering the defense, I think that either the bwauth is on the edge there or it is not that.

If the bwauth aren't opening that many circuits, I would blame the network load or/and bwauth code?

Possibly. We should look for these issues in 0.3.3.2-alpha.

comment:76 Changed 8 months ago by teor

And here's an update from my relay logs. The defences seem to be working:

[notice] Heartbeat: Tor's uptime is 2 days x hours, with 11xxxx circuits open. I've sent 57xx.xx GB and received 56xx.xx GB.
[notice] Circuit handshake stats since last time: 12xxxxx/12xxxxx TAP, 22xxxxxx/22xxxxxx NTor.
[notice] Since startup, we have initiated x v1 connections, x v2 connections, x v3 connections, and 31xxx v4 connections; and received x v1 connections, 45xxx v2 connections, 70xxx v3 connections, and 80xxxx v4 connections.
[notice] DoS mitigation since startup: 68xxxxxx circuits rejected, 4xx marked addresses. 98xxxx connections closed. 18xx single hop clients refused.

comment:77 Changed 8 months ago by nickm

This branch is now merged into 0.2.9.x and forward.

comment:78 Changed 8 months ago by dgoulet

Resolution: fixed
Status: merge_readyclosed

Great success. Closing this for completion!

comment:79 Changed 7 months ago by hurus

Resolution: fixed
Status: closedreopened

It is a crazy decision to merge those changes.

Some thoughts:

  1. Since release of 0.2.9.15, 0.3.1.10, 0.3.2.10 i myself as an extensive user of tor, my internal scripts which scan Tor network and monitored onion sites started to see frequent unexpected broken requests.

It looks like this:
You browse your wished onion and then suddently when you change or refresh the page, connection immediately drops. Tor Browser shows 'Unable to connect' message in this case.

It happens at around 1% of requests at the moment in my tests. And i suspect that number to be much higher when more relays will update or consensus will change.

I suspect such behaviour will lead to inability for Tor network to keep it's usage patterns. Users will not use system which randomly drops connections. Before Tor was very reliable. For years of usage i never noticed any dropped connection issues without proper automatic reconnection in place. Regardless of how relays were loaded, everything was smooth.

  1. Such feature as dropping connections is very close to 'shadow ban'. We can end up to censor Tor network by Tor network. Tor network was made to fight censorship, but 'shadow ban' and 'connection drop' is a way to the other direction - to the censorship / selectivity of whom allowed to query Tor network and whom is not. This is not a way which Tor Project should follow to.
  1. We may potentially face a situation where relays will drop all or most connections and Tor client will not be ever able to establish connection with a rate more than a crazy minimal (e.g. 5 requests per second). Such rate is not suitable for a regular browsing. Even if we hope that all web properties are ideally optimized, there are still conditions which will lead to problems. Most web properties usually use 5 - 200 elements per page which are loaded simultaneously. It usually means 20 constant concurrent connections and 20 - 500 requests per second on each client. Means that with usual limits no one will correctly load all elements of a web property.
  1. Under tests there was indicator of incorrectly used 'malicious' word. You were counting requests to be 'malicious' but you have no clue are they 'malicious' at all. You are likely block regular browsing of high volume / less optimized web properties but you pretending that you blocking something 'malicious'. I highly recommend to remove that word as it is incorrect and do not show reality.
  1. Is it really a way to go to make limits per IPv4 address in modern world? We live in IPv4-shortage world. We live in the world of NAT and potentially tens of clients per IPv4 address. By limiting IPv4 we only limit real users which for some reason send too much requests from their IPv4 address.

At start of the century it was a way to go. Nowadays, that's no-op. Such behaviour is suicidal for any online instance.

  1. Is it really a way to go to make limits per IPv6 address in modern world? We can blindly imagine that attacker can potentially use billions of IPv6 addresses. Limiting activity per IPv6 addresses will not impact DOS behaviour in any way but oppositely will cause refuse of regular clients which do not trying to bypass specific mitigations.
  1. Why do we limit usage patterns like crawling or high-loaded projects which are trying to lower costs by using a single IP address? Are those not allowed usage patterns on Tor? I doubt so. Current set of patches forces any service which are using Tor network extensively to build hard expensive systems to simply use Tor network for anything high-volumetric.

All in all i think those changes are horrible and should be either reverted and reconsidered. It is one of the most harmful commitment in years of Tor developement by my opinion. It looks like this set of patches was made for a different world, not the one where we live. As it is disconnected from reality. The only correct way to limit is by group set of unique identifications. It can be: IP address - target destination - Tor node (requester) identifier (if any). But it never should be IP address solely and never should be forced (be not voluntary), that is not for you to decide should relay block requests or not, thats for them to decide and for other Tor network participants to decide will they communicate using that particular relays with limits or not.

comment:80 Changed 7 months ago by cypherpunks

Resolution: fixed
Status: reopenedclosed

@hurus my understanding is that the parameters will be weakened as times goes on, since this is just an emergency thing to do since the recent DDoS was very hard on the network. As a result now the DDoSers stopped and moved on.

In the meantime you can set a guard node that doesn't have a Tor version with the recent DDoS mitigation. You can easily find them on the relay search in metrics.torproject.org

comment:81 Changed 7 months ago by cypherpunks

FWIW, Firefox can open more than 100 connections at once, so it is a relatively small threshold.

comment:82 Changed 7 months ago by hurus

Replying to cypherpunks:

FWIW, Firefox can open more than 100 connections at once, so it is a relatively small threshold.

Even if we count that Firefox will open less than 20 connections at once, limits are damaging everything.

Number of end users decreases exponentially not because DoS stop but because regular users stop using Tor as they can not use it due to dumb limits. Decrease higher in countries with higher number of regular users and faster and more stable overall networks as those users tend to not use slow / unstable networks, this is just not acceptable in their life style / life environment. USA, China, Hong Kong, Mostly Europe have fast and stable in-country networks which shows in graph as massive users decrease. Countries with unstable and/or slow networks like Russia, Ukraine, Africa (all countries), Mexico, India / Pakistan, Iraq, Iran has less users decrease for the moment. But it will not last forever. Users will tend to stop using Tor at all. After users will stop using it, relay / bridge / guard owners will have no clue to run Tor nodes and will stop them as well.

If it will continue that way, Tor will need a fork.

comment:83 in reply to:  82 Changed 7 months ago by cypherpunks

Replying to hurus:

Number of end users decreases exponentially not because DoS stop but because regular users stop using Tor as they can not use it due to dumb limits.

This has no basis in reality, the DoS impact on regular normal users is 0 (just read the default config).

comment:84 Changed 7 months ago by cypherpunks

This has no basis in reality, the DoS impact on regular normal users is 0 (just read the default config).

/* DoSCircuitCreationMinConnections default */
#define DOS_CC_MIN_CONCURRENT_CONN_DEFAULT 3
/* DoSCircuitCreationRateTenths is 3 per seconds. */
#define DOS_CC_CIRCUIT_RATE_DEFAULT 3
/* DoSCircuitCreationBurst default. */
#define DOS_CC_CIRCUIT_BURST_DEFAULT 90
/* DoSCircuitCreationDefenseTimePeriod in seconds. */
#define DOS_CC_DEFENSE_TIME_PERIOD_DEFAULT (60 * 60)
DoSConnectionMaxConcurrentCount=50
/* DoSConnectionDefenseType maps to the dos_conn_defense_type_t enum. */
#define DOS_CONN_DEFENSE_TYPE_DEFAULT DOS_CONN_DEFENSE_CLOSE

https://metrics.torproject.org/relayflags.html
or
default_bridges.low_n

https://en.wikipedia.org/wiki/Network_address_translation

comment:85 in reply to:  79 Changed 7 months ago by asn

Replying to hurus:

It is a crazy decision to merge those changes.

Hello hurus. What would be your preferred response from the Tor team to the recent overload of the network? Just to be clear, we also don't like adding rate-limits and throttles to the network, but it seems like a necessary countermeasure for now given the extra clients that surged into the network. The performance and timeouts had already degraded before we added the DoS subsystem. What would you have done?

comment:86 in reply to:  81 Changed 7 months ago by arma

Replying to cypherpunks:

FWIW, Firefox can open more than 100 connections at once, so it is a relatively small threshold.

Whoever wrote this is quite confused. When Firefox makes 100 stream requests to the Tor client, the Tor client then bundles them over a single TLS connection to its guard. So that's one connection from Tor, no matter how many connections there are from Firefox.

This is part of why we avoid the word "connection" in most of our terminology, because it's so easy to misunderstand which layer of connection people are talking about.

Note: See TracTickets for help on using tickets.