This ticket is for adding a denial of service mitigation subsystem to tor.
Because of the latest issues we've been having on the network with 1 million users most likely resulting in the huge loads on the relays we've been seeing, this subsystem is to provide a framework in order to add defense to tor for potential (voluntary or not) denial of service.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
First, there are no unit tests because before doing so I wanted more opinions on the design, engineering and overall structure of the code.
Second, this code has been running on my relay for ~4 days where more than 330 IPs have been identified has malicious and for which cells are being dropped (which is the defense in place).
Third, there could be still an issue with client traffic going through an Exit and back in the network, we need to address this or at least mitigate it as much as we can before we deploy.
Fourth, this feature is disabled by default and I would expect that in normal circumstances, it won't be used at all. I see this as a way to help out in situations like the one we are in right now.
Last thing, there is another possible mitigation with regards of high number of concurrent TCP connections doing tor2web. We are seeing that at high rate right now on the network (most likely scanning the "DarkWeb") but this branch is NOT about that but a detection/defense could take advantage of this code in many ways.
This seems like it may highly stress/kill off as well relays with old Tor versions when the DDoSers change their guard (due to this patch) and it eventually settles at some relay with an old Tor version.
As I suggested privately, I believe the best defense against tor traffic via an exit is to count unauthenticated (client, bridge, onion service) and authenticated (public relay) connections separately.
This seems like it may highly stress/kill off as well relays with old Tor versions when the DDoSers change their guard (due to this patch) and it eventually settles at some relay with an old Tor version.
Yes that is one of the worry I do have. However, this circuit creation mitigation defense silently drop cells on a created circuit. In other words, clients will open circuits on the Guard and the Guard returns CREATED as a response so the client thinks it is valid and thus sends bunch of cells that are silently dropped by the Guard at that point.
I believe this makes the client not switch Guard and just keep sending stuff to the void. So the big Guard will soak up the load instead of spreading it out.
Not perfect but a first step towards better defense.
As I suggested privately, I believe the best defense against tor traffic via an exit is to count unauthenticated (client, bridge, onion service) and authenticated (public relay) connections separately.
Yes indeed, that part is missing. I'm not entirely sure why we should track independently connections here, this DoS mitigation only tracks client connections.
So basically, I think we could do this for this extra "Exit detection" protection which would be to check if it is a known digest and maybe also check if we do have a matching non client channel for the address. What do you think?
It is probably ok to ignore relay connections for now, but if we ever get a DDoS via relays or by unpublished relays, we will be sorry and wish we had counted relays and clients separately.
I've added the relay detection that was discussed with teor. Also, I've simplified things a bit after a discussion with asn on IRC about not using the circuit delta and circuit timeout value but rather a fixed maximum circuit count for which we can compute before in the consensus instead at the relay. And as per the discussion as well, I've added a cap of concurrent connections used in the circuit threshold equation. See top commit.
Oh dear -- does this really have to be 0.3.3 only? Can this rebase cleanly back on to an older branch, so that we know how hard the backport will be?
Hmmm I didn't thought this version would be a version to consider to backport nor even up to where we would want to backport.
Very little touches the current code base, most of it is in its own file so we should be able to backport this properly but will require different branches for each backported version I believe.
Additionally, I'm going to start reviewing the branch on oniongit too
Very little touches the current code base, most of it is in its own file so we should be able to backport this properly but will require different branches for each backported version I believe.
Actually, it might not. Often we can just write one branch against maint-0.2.9, and merge it forward into all the other branches.
OK I did a basic review of the code and the design.
I think the current code complexity stems from the slot/bucket design, and splitting the time periods into slots, marking them, and assessing the circuits based on slots. I think without the slot system the logic could be as simple as:
-> for every new circuit of this IP, nr_of_circuits++;-> for every new conn of this ip, nr_of_conns++;-> every N seconds, reset nr_of_circuits for this IP.if (nr_of_conns > conn_magic_number || nr_of_circuits > circ_magic_number) { return DROP;} return GOOD;
I understand that the slot design can eventually allow us to even block attackers with a single connection while allowing normal clients to do circuits bursts, but I'm questioning whether the complexity is worth it. Furthermore, it's possible that the slot system can be exploited by attackers, by really going all out during some 30 second slots, and staying more chill for the rest of them, and still getting a pass for the entire time period.
If we kill the slot system, we will get a very simple system but it will be less versatile, and we will need to have a bigger magic_number to be able to keep our false positives at a reasonable rate.
Moving this back to "accepted" since a lot will change after IRC discussions. The new and hopefully simpler design is this now:
Have a circuit token bucket per-IP which is refilled with some value at some rate defined by consensus parameters. Remove token from bucket every time a CREATE is seen. If bucket goes down to 0, activate defense if the number of concurrent connection is above a certain threshold defined by a consensus parameter.
Detect high connection amount of connections per-IP and start closing connections for that IP if that reaches a too high threshold specified by a consensus parameter.
Add a torrc option and/or consensus parameter to refuse client connection with ESTABLISH_RENDEZVOUS or in other words, an anti tor2web option at the relay. These have been observed to be quite problematic as people are running hundreds (if not thousands) of tor2web clients scanning the onion space. As collateral damage, it is loading relays with connections for rendezvous circuits. We could easily integrate that option with a certain threshold of parallel connection like "if I see 10 conn on that IP doing RDV, block".
Ok, some code is ready implementing the above. It defers greatly from the previous branch. Most of it has been simplified. This branch also now implements the 3 detections mentioned above which are (1) circuit creation DoS, (2) concurrent connection DoS and (3) ESTABLISH_RENDEZVOUS from client (tor2web) DoS.
Some stuff you might find useful to know before you dive in:
Each detection (listed above) have for now only one single type of defense implemented so if we think of more that we might want short term, now is a good time to get them in. Defenses can be selected by a consensus parameter.
For the (3) defense, I've gone quite explicitly with a torrc option (controlled also by a consensus param) to refuse tor2web client connections. There is no threshold no nothing for now because frankly I think tor2web clients are more hurting us than anything else by their ability to directly connect to all relays and thus induce resources pressure onto all relays "naturally"...
This code uses the geoip client cache which seems fine but has an interesting quirks. After 24h, an entry is wiped out which means we loose all the DoS mitigation statistics for the entry at that point. Not too bad because if the client address is still DoSing, it will be detected again and blocked. Wouldn't be too hard to not do that but would require a bit more code/thinking to clean it up so it doesn't grow infinitely.
The branch you are about to review is based on 029 considering a possible backport. If you want to test this on a relay that was previously >= 0.3.1, know that your client won't connect to it until you get a new consensus because it will be expecting an ed25519 key for which there is none used at 029. Either use an older client or merge forward to latest master by resolving the few conflicts there will be :).
All the DoS detection and defense are disabled by default. It requires a consensus param to be set for them to be enabled. So, if you want to test this on a relay, go in src/or/dos.h and change the enabled default values in there for both CC and CONN defenses.
A log has been added to the heartbeat so if it works, you should see such a line (real one!):
Jan 21 16:32:57.647 [notice] DoS mitigation since startup: 459085 cells rejected, 128 marked address. 235 MB have been dropped. 20126 connection rejected. 200 tor2web client refused.
Let us bikeshed on names here if you don't like them and please propose an alternative because this is currently the best I can come up with my "non-English-been-a-month-on-DoS-land brain" :).
I think we should add two more Tor2web defences managed by a consensus parameter:
when an introduce cell is sent direct from a client, drop that cell and any extend requests
this is really important because it delays Tor2web introductions and failed introduction extends
drop HSDir lookups where the circuit came directly from a client
I think we should wait a release or two to turn the introduce and HSDir ones on.
But if it gets really bad, and we backport them to 0.2.9, maybe we can turn them on sooner.
I also think that Tor2web combined with single onion services makes a DDoS much more likely.
Neither end has any guards, and they both make single hop connections,
And we're not defending against that at all right now.
When the service side is a directly connected client (single onion service):
we should automatically activate the introduce defence
this is very effective, because it stops Tor2web straight away
we should automatically activate the rendezvous defence (drop all cells) as soon as the service connects
this is not very effective, because the rendezvous has established, but it's important for security
I've gone over Roger's review in the oniongit. Some discussions are left to be answered.
asn will soon hand off to me a unittests branch (very awesome) so expect that at some point, I'll take over and put it in as an extra commit.
I think we should add two more Tor2web defenses managed by a consensus parameter:
Thanks teor for this, I 100% agree with you. What I'm wondering here is if we should take the time to also implement these and backport them or for now we only put in the RP one (which I think the worst one because clients do open the RP before doing the introduction) and put the others in 034+ ? If the later, I propose we open a new ticket for this "anti DoS + tor2web" issue because also at that point, if we end up with relays just denying direct client connections for HS purposes, we should start considering strongly to rip off the tor2web code from Tor. I won't start a "why do that discussion" in this ticket.
I think we should add two more Tor2web defenses managed by a consensus parameter:
Thanks teor for this, I 100% agree with you. What I'm wondering here is if we should take the time to also implement these and backport them or for now we only put in the RP one (which I think the worst one because clients do open the RP before doing the introduction) and put the others in 034+ ? If the later, I propose we open a new ticket for this "anti DoS + tor2web" issue because also at that point, if we end up with relays just denying direct client connections for HS purposes, we should start considering strongly to rip off the tor2web code from Tor. I won't start a "why do that discussion" in this ticket.
Do we know if the extra load is bringing down HSDirs?
(The fetch creates more load, but HS descriptors are cached by clients.)
Let's open separate tickets for 0.3.4 for blocking Tor2web HSDir and Intro. And we should think about backporting the HSDir defence, because we will want it if the load gets worse.
We might also want to block single onion / Tor2web intros and rendezvous by default, and backport the code for security. The existing tickets are #22688 (moved) and #22689 (moved).