Tor deals poorly with a very large number of incoming connection requests.

changed milestone to %Tor: unspecified

added component::core tor/tor milestone::Tor: unspecified owner::yawning performance points::15 priority::medium reporter::alberto resolution::duplicate severity::normal sponsor::27-can status::closed tor-dos tor-hs type::defect version::tor 0.2.5.11 labels

Trac:
Username: alberto

debug.log.bz2

Debug log, maybe help understand situation

Trac:
Username: alberto

Please see problem, at current time every hidden service in network can be disabled by this method.

Trac:
Username: alberto

Trac:
Username: alberto
Summary: TOR CPU load 100%. Hidden service unavailable. Maybe zero-day vulnerability circuit storm. to TOR CPU load 100%. Hidden service unavailable. Maybe zero-day vulnerability like "circuit storm".
Priority: major to critical

Confirm! I have the same situation with le hidden service. CPU load is 100% and le hidden service is unavailable. Log file is identique.

Trac:
Username: dissapear

Hello we took a brief look at this.

Some comments:

Our current hypothesis is that you are receiving good amounts of client traffic, which makes your already overloaded guard fail your rendezvous circuits. When Tor sees its rendezvous circuits failing, it aggressively relaunches them which overloads the guard even more. Basically, your guard is asking for oxygen and Tor chokes it more.

We need to look at whether we can make this relaunching logic less aggressive or at least more conservative during busy times, to put less stress on the guard. We need some time to read the logs more and understand them better, but this seems an important issue that needs to be solved.
If you want to answer: Is this a very busy hidden service? Do you already expect big amounts of client activity, or could this be a DoS?
The logs you gave us did not contain warn or notice severities. That might be because you are already redirecting these severities somewhere else? Could you prepare a log file for us that reproduces this behavior but also includes debug/info/notice/warn severities? Thanks!
If you feel experimental, and you want a short-term solution try setting CloseHSServiceRendCircuitsImmediatelyOnTimeout 1 on your torrc. This might reduce the amount of relaunching you do which might help. But it's an experimental option so you might experience reachability issues. Please don't send us log files with this option on.

It's not very busy hidden service. 10-20-30 users at site online. Count of users to site- no increased in last time. In normal work- when I enable debug- his size- maybe 5-10Mb in few minutes. Now- I enable debug- his size 250mb in 2 minutes. Earlier server never has problems like this. TOR get 0.3-3% of CPU.
I try make logs with required for you information in next 1-2 hour.
Already trying set CloseHSServiceRendCircuitsImmediatelyOnTimeout 1 on torrc. No changes in CPU load(100% in few seconds). Hidden service not work. Attack to name of this service. When he disabled- another hidden services work without problems and TOR not overload CPU. But when work attacked service- all another HDS too practically down.

Trac:
Username: alberto

(You might also want to idle at IRC on #tor-dev in the OFTC network. Many Tor developers are there, and we might be able to help more synchronously.)

Now I collect logs for you. After this try go to IRC.

But now- attack is stopped(last day hidden service fully disabled, i try enable now HDS - all like normal work, TOR not load CPU). But in any time this hell is continue. I have old log with 'notice' severities, fast as soon attach they in ticket.

Thank you for attention to problem.

Trac:
Username: alberto

Attack is continue :(

I collect for you full debug log with all severities. Daemon start, I wait for tor overload CPU(near 1-1.5 min), small wait for collect data and shutdown daemon.

File is big for bugtracker, please download at https://www.sendspace.com/file/y15neg

Now try connect to IRC.

Trac:
Username: alberto

Changing the title to something that better describes the symptoms of the issue. As far as I can tell, this isn't anything new. Tor always had a hard time dealing with an extreme number of HS clients, and this appears to be an extreme illustration of what is a known issue (In particular #8902 (moved)).

Note: I'm not ruling out a crazy client side bug here that causes it to retry in a tight loop, new intro point and all, but even if that's the case, the server side problems still exist and need to be addressed.

So time to discuss potential solutions/mitigation ideas.

Yes, this will be a clear improvement:

Solve #11447 (moved). 8 is more than likely too high for a retry count, especially when under load.
Implement the performance fixes suggested in #8902 (moved).
Prop 224. Faster public key crypto should help things here.

Maybe an improvement:

Consider rejecting single hop client to RP circuits. Probably not a great idea since Tor2Web uses this functionality for performance reasons, and people that want to mount the "pro" version of this can just run their own RP, so there's not much to gain here.
Give serious thought to being more aggressive about dropping INTRODUCE2 cells when under load. While this will not prevent a HS from being extremely hard to reach, the current behavior is clearly not optimal here.

No, not an option:

Cycle the IP under load. Fetching the new IP is trivial, there is no gain here, and a lot of really scary anonymity impacts. (This would be the opposite of #8239 (moved), which would be bad).

Trac:
Keywords: N/A deleted, tor-hs added
Summary: TOR CPU load 100%. Hidden service unavailable. Maybe zero-day vulnerability like "circuit storm". to Tor deals poorly with a very large number of incoming connection requests.
Version: N/A to Tor: 0.2.5.11
Milestone: N/A to Tor: 0.2.7.x-final

Trac:
Keywords: tor-hs deleted, tor-hs SponsorR SponsorZ added

More questions for future log digging:

Does the attacker send multiple INTRODUCE1 cells on a single circuit? So basically he can cause N rendezvous with a single circuit? We should look on whether this is possible and maybe fix it.
In the log of comment:8, we received about 14k INTRODUCE2 cells, which means that we tried to establish that many rendezvous circuits. We also did many relaunches of rendezvous circuits because of failures. Two important questions here: Why did those rend circs fail, is it because of the guard? How many circuit relaunches happened in total?

Another question:

If the cause of the failing circuits is the guard, did we enable this attack when we switched to a single guard node?

So we got profiler output and more logs and other things, and found out a bunch of stuff:

Reducing MAX_REND_FAILURES to 1 doesn't appear to help much if at all.
The profiler results matches dgoulet's #13739 (moved) results fairly closely, with smartlist_remove consuming more cpu, due to extra calls due to more rend_service_relaunch_rendezvous. The largest consumer of CPU is Curve25519, so things like #8897 (moved)/#9663 (moved) will help, as well as offloading ntor onto the worker.

We still need mitigation (probably in the form of dropping INTRODUCE2 cells at the HS, INTRODUCE1 cells at the IP, or both), since no matter how much faster we can make tor, it's easier for the adversary to increase malicious traffic.

Some long-term solutions that have been proposed:

Do active queue management on introductions (#15516 (moved))
Use WRED queue logic or CoDeL or Stochastic Fair Blue
Apply filtering on Introduction Points
IPs could rate limit clients / ask for Proof-of-Work / etc.
Load balancing traffic through Intro Points ("scaling" ideas)
http://archives.seul.org/or/talk/Mar-2015/msg00206.html
Load balancing traffic through bridges
Proposal 188 / s7r
Nove to I2P model of static long-term inbound gateways
Requires multiplexing multiple clients in a single circuit which probably has various security implications.
https://geti2p.net/en/docs/tunnels/implementation

I managed to reproduce this and do a few tests with short-term solutions:

Decreasing MAX_REND_FAILURES didn't really help.
Completely disabling relaunches and killing rend circuit on first timeout did not help either.
Hard-coding the second hop (with my sticky_mids branch) in an attempt to reduce path selection CPU time did not really help either.

Another thought. Can we figure out whether such volume of INTRODUCE1 cells is possible without #15515 (moved)? If the attacker is not using #15515 (moved), and the IP can handle that many circuits, why can't our hidden service also handle them? If the attacker is using #15515 (moved), we should really fix it. A small info that might point towards #15515 (moved), is that on the first logs, the HS had 3 IPs. The first IP sent us 11k INTRODUCE2 cells, the second 3.5k INTRODUCE2 cells, and the last only 200. Similarly, on the last logs the first IP sent 6k INTRODUCE2 cells, the second 3k INTRODUCE2 cells and the last about 50. What I'm trying to say here friends is that the distribution is not uniform as would be expected by a normal client, and also the two distributions are quite similar.

Replying to asn:

Can we figure out whether such volume of INTRODUCE1 cells is possible without #15515 (moved)? If the attacker is not using #15515 (moved), and the IP can handle that many circuits, why can't our hidden service also handle them? [...] on the first logs, the HS had 3 IPs. The first IP sent us 11k INTRODUCE2 cells, the second 3.5k INTRODUCE2 cells, and the last only 200. Similarly, on the last logs the first IP sent 6k INTRODUCE2 cells, the second 3k INTRODUCE2 cells and the last about 50. What I'm trying to say here friends is that the distribution is not uniform as would be expected by a normal client, and also the two distributions are quite similar.

Another explanation (alas) might be that each of the main two intro points here had different capacity to handle incoming requests, so they each got saturated at a different level.

Tor deals poorly with a very large number of incoming connection requests.

Absolutely same situation as https://lists.torproject.org/pipermail/tor-talk/2014-December/035833.html

use little bandwidth, and seem to involve each request having a new rendezvous for each attempt, using lots of resources

Child items ...

Activity