Opened 3 years ago

Last modified 5 weeks ago

#15463 new defect

Tor deals poorly with a very large number of incoming connection requests.

Reported by: alberto Owned by: yawning
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.2.5.11
Severity: Normal Keywords: tor-hs, performance, dos
Cc: Actual Points:
Parent ID: #24298 Points:
Reviewer: Sponsor: SponsorR-can

Description

After start tor- in few minutes CPU load 100% and hidden service unavailable.
If disable concrete hidden service and restart tor- all normal.
Problem very same as
https://lists.torproject.org/pipermail/tor-talk/2014-December/035807.html

In log many same records:

Mar 26 10:57:48.000 [notice] Tried for 120 seconds to get a connection to [scrubbed]:8333. Giving up.
Mar 26 10:58:26.000 [notice] We tried for 15 seconds to connect to '[scrubbed]' using exit $3EAAAB35932610411E24FA4317603CB5780B80BC~AccessNow002 at 176.10.99.201. Retrying on a new circuit.
Mar 26 10:58:42.000 [notice] We tried for 15 seconds to connect to '[scrubbed]' using exit $379FB450010D17078B3766C2273303C358C3A442~aurora at 176.126.252.12. Retrying on a new circuit.
Mar 26 10:59:04.000 [notice] Closing stream for '[scrubbed].onion': hidden service is unavailable (try again later).
Mar 26 11:01:21.000 [notice] Tried for 130 seconds to get a connection to [scrubbed]:8333. Giving up.
Mar 26 11:02:05.000 [notice] Tried for 123 seconds to get a connection to [scrubbed]:0. Giving up. (waiting for circuit)
Mar 26 11:02:05.000 [notice] Tried for 123 seconds to get a connection to [scrubbed]:0. Giving up. (waiting for circuit)
Mar 26 11:02:05.000 [notice] Tried for 121 seconds to get a connection to [scrubbed]:0. Giving up. (waiting for circuit)
Mar 26 11:02:05.000 [notice] Tried for 129 seconds to get a connection to [scrubbed]:0. Giving up. (waiting for circuit)
Mar 26 11:02:05.000 [notice] Tried for 124 seconds to get a connection to [scrubbed]:0. Giving up. (waiting for circuit)
Mar 26 11:02:18.000 [notice] Tried for 131 seconds to get a connection to [scrubbed]:0. Giving up. (waiting for circuit)

Or
Mar 26 11:02:51.000 [notice] Your Guard torpidsUKuk2 ($C9933B3725239B6FAB5227BA33B30BE7B48BB485) is failing more circuits than usual. Most likely this means the Tor network is overloaded. Success counts are 116/171. Use counts are 48/49. 117 circuits completed, 1 were unusable, 0 collapsed, and 126 timed out. For reference, your timeout cutoff is 87 seconds.

Absolutely same situation as
https://lists.torproject.org/pipermail/tor-talk/2014-December/035833.html


use little bandwidth,
and seem to involve each request having a new rendezvous for each
attempt, using lots of resources


Problem exist at all versions(0.2.5, 0.2.6, master from git)

At current time few hidden services in TOR network DDOSed by this method.

Child Tickets

TicketStatusOwnerSummaryComponent
#11447closedFind a better value for MAX_REND_FAILURESCore Tor/Tor
#13738newMake worker handle introduction point cryptoCore Tor/Tor
#13739newOptimize the functions called in circuit_launch_by_extend_info()Core Tor/Tor
#15515closedDon't allow multiple INTRODUCE1s on the same circuitCore Tor/Tor
#15540newIncrease the capacity of a HS server by using bridges after we implement Prop 188Core Tor/Tor
#15544closedRefuse INTRODUCE1 cell if circuit was created with CREATE_FASTCore Tor/Tor
#17037closedToo many introductions makes hidden service unreachableCore Tor/Tor

Attachments (1)

debug.log.bz2 (2.4 MB) - added by alberto 3 years ago.
Debug log, maybe help understand situation

Change History (36)

Changed 3 years ago by alberto

Attachment: debug.log.bz2 added

Debug log, maybe help understand situation

comment:1 Changed 3 years ago by alberto

Please see problem, at current time every hidden service in network can be disabled by this method.

comment:2 Changed 3 years ago by alberto

Priority: majorcritical
Summary: TOR CPU load 100%. Hidden service unavailable. Maybe zero-day vulnerability circuit storm.TOR CPU load 100%. Hidden service unavailable. Maybe zero-day vulnerability like "circuit storm".

comment:3 Changed 3 years ago by dissapear

Confirm!
I have the same situation with le hidden service. CPU load is 100% and le hidden service is unavailable. Log file is identique.

comment:4 Changed 3 years ago by asn

Hello we took a brief look at this.

Some comments:

  • Our current hypothesis is that you are receiving good amounts of client traffic, which makes your already overloaded guard fail your rendezvous circuits. When Tor sees its rendezvous circuits failing, it aggressively relaunches them which overloads the guard even more. Basically, your guard is asking for oxygen and Tor chokes it more.

We need to look at whether we can make this relaunching logic less aggressive or at least more conservative during busy times, to put less stress on the guard. We need some time to read the logs more and understand them better, but this seems an important issue that needs to be solved.

  • If you want to answer: Is this a very busy hidden service? Do you already expect big amounts of client activity, or could this be a DoS?
  • The logs you gave us did not contain warn or notice severities. That might be because you are already redirecting these severities somewhere else? Could you prepare a log file for us that reproduces this behavior but also includes debug/info/notice/warn severities? Thanks!
  • If you feel experimental, and you want a short-term solution try setting CloseHSServiceRendCircuitsImmediatelyOnTimeout 1 on your torrc. This might reduce the amount of relaunching you do which might help. But it's an experimental option so you might experience reachability issues. Please don't send us log files with this option on.

comment:5 Changed 3 years ago by alberto

1) It's not very busy hidden service. 10-20-30 users at site online. Count of users to site- no increased in last time.
In normal work- when I enable debug- his size- maybe 5-10Mb in few minutes. Now- I enable debug- his size 250mb in 2 minutes.
Earlier server never has problems like this. TOR get 0.3-3% of CPU.
2) I try make logs with required for you information in next 1-2 hour.
3) Already trying set CloseHSServiceRendCircuitsImmediatelyOnTimeout 1 on torrc. No changes in CPU load(100% in few seconds). Hidden service not work.
Attack to name of this service. When he disabled- another hidden services work without problems and TOR not overload CPU.
But when work attacked service- all another HDS too practically down.

comment:6 Changed 3 years ago by asn

(You might also want to idle at IRC on #tor-dev in the OFTC network.
Many Tor developers are there, and we might be able to help more synchronously.)

comment:7 Changed 3 years ago by alberto

Now I collect logs for you.
After this try go to IRC.

But now- attack is stopped(last day hidden service fully disabled, i try enable now HDS - all like normal work, TOR not load CPU). But in any time this hell is continue.
I have old log with 'notice' severities, fast as soon attach they in ticket.

Thank you for attention to problem.

comment:8 Changed 3 years ago by alberto

Attack is continue :(

I collect for you full debug log with all severities.
Daemon start, I wait for tor overload CPU(near 1-1.5 min), small wait for collect data and shutdown daemon.

File is big for bugtracker, please download at
https://www.sendspace.com/file/y15neg

Now try connect to IRC.

comment:9 Changed 3 years ago by yawning

Keywords: tor-hs added
Milestone: Tor: 0.2.7.x-final
Summary: TOR CPU load 100%. Hidden service unavailable. Maybe zero-day vulnerability like "circuit storm".Tor deals poorly with a very large number of incoming connection requests.
Version: Tor: 0.2.5.11

Changing the title to something that better describes the symptoms of the issue. As far as I can tell, this isn't anything new. Tor always had a hard time dealing with an extreme number of HS clients, and this appears to be an extreme illustration of what is a known issue (In particular #8902).

Note: I'm not ruling out a crazy client side bug here that causes it to retry in a tight loop, new intro point and all, but even if that's the case, the server side problems still exist and need to be addressed.

So time to discuss potential solutions/mitigation ideas.

Yes, this will be a clear improvement:

  • Solve #11447. 8 is more than likely too high for a retry count, especially when under load.
  • Implement the performance fixes suggested in #8902.
  • Prop 224. Faster public key crypto should help things here.

Maybe an improvement:

  • Consider rejecting single hop client to RP circuits. Probably not a great idea since Tor2Web uses this functionality for performance reasons, and people that want to mount the "pro" version of this can just run their own RP, so there's not much to gain here.
  • Give serious thought to being more aggressive about dropping INTRODUCE2 cells when under load. While this will not prevent a HS from being extremely hard to reach, the current behavior is clearly not optimal here.

No, not an option:

  • Cycle the IP under load. Fetching the new IP is trivial, there is no gain here, and a lot of really scary anonymity impacts. (This would be the opposite of #8239, which would be bad).

comment:10 Changed 3 years ago by nickm

Keywords: SponsorR SponsorZ added

comment:11 Changed 3 years ago by asn

More questions for future log digging:

  • Does the attacker send multiple INTRODUCE1 cells on a single circuit? So basically he can cause N rendezvous with a single circuit? We should look on whether this is possible and maybe fix it.
  • In the log of comment:8, we received about 14k INTRODUCE2 cells, which means that we tried to establish that many rendezvous circuits. We also did many relaunches of rendezvous circuits because of failures. Two important questions here: Why did those rend circs fail, is it because of the guard? How many circuit relaunches happened in total?

comment:12 Changed 3 years ago by asn

Another question:

  • If the cause of the failing circuits is the guard, did we enable this attack when we switched to a single guard node?

comment:13 Changed 3 years ago by yawning

So we got profiler output and more logs and other things, and found out a bunch of stuff:

  • Reducing MAX_REND_FAILURES to 1 doesn't appear to help much if at all.
  • The profiler results matches dgoulet's #13739 results fairly closely, with smartlist_remove consuming more cpu, due to extra calls due to more rend_service_relaunch_rendezvous. The largest consumer of CPU is Curve25519, so things like #8897/#9663 will help, as well as offloading ntor onto the worker.

We still need mitigation (probably in the form of dropping INTRODUCE2 cells at the HS, INTRODUCE1 cells at the IP, or both), since no matter how much faster we can make tor, it's easier for the adversary to increase malicious traffic.

Last edited 3 years ago by yawning (previous) (diff)

comment:14 Changed 3 years ago by asn

Some long-term solutions that have been proposed:

  • Do active queue management on introductions (#15516)
    • Use WRED queue logic or CoDeL or Stochastic Fair Blue
  • Apply filtering on Introduction Points
    • IPs could rate limit clients / ask for Proof-of-Work / etc.
  • Load balancing traffic through Intro Points ("scaling" ideas)
  • Load balancing traffic through bridges
    • Proposal 188 / s7r
  • Nove to I2P model of static long-term inbound gateways
Last edited 3 years ago by asn (previous) (diff)

comment:15 Changed 3 years ago by asn

I managed to reproduce this and do a few tests with short-term solutions:

  • Decreasing MAX_REND_FAILURES didn't really help.
  • Completely disabling relaunches and killing rend circuit on first timeout did not help either.
  • Hard-coding the second hop (with my sticky_mids branch) in an attempt to reduce path selection CPU time did not really help either.

Another thought. Can we figure out whether such volume of INTRODUCE1 cells is possible without #15515? If the attacker is not using #15515, and the IP can handle that many circuits, why can't our hidden service also handle them? If the attacker is using #15515, we should really fix it.
A small info that might point towards #15515, is that on the first logs, the HS had 3 IPs. The first IP sent us 11k INTRODUCE2 cells, the second 3.5k INTRODUCE2 cells, and the last only 200. Similarly, on the last logs the first IP sent 6k INTRODUCE2 cells, the second 3k INTRODUCE2 cells and the last about 50. What I'm trying to say here friends is that the distribution is not uniform as would be expected by a normal client, and also the two distributions are quite similar.

comment:16 in reply to:  15 ; Changed 3 years ago by arma

Replying to asn:

Can we figure out whether such volume of INTRODUCE1 cells is possible without #15515? If the attacker is not using #15515, and the IP can handle that many circuits, why can't our hidden service also handle them?
[...] on the first logs, the HS had 3 IPs. The first IP sent us 11k INTRODUCE2 cells, the second 3.5k INTRODUCE2 cells, and the last only 200. Similarly, on the last logs the first IP sent 6k INTRODUCE2 cells, the second 3k INTRODUCE2 cells and the last about 50. What I'm trying to say here friends is that the distribution is not uniform as would be expected by a normal client, and also the two distributions are quite similar.

Another explanation (alas) might be that each of the main two intro points here had different capacity to handle incoming requests, so they each got saturated at a different level.

comment:17 in reply to:  14 Changed 3 years ago by s7r

Replying to asn:

Some long-term solutions that have been proposed:

  • Do active queue management on introductions (#15516)
    • Use WRED queue logic or CoDeL or Stochastic Fair Blue
  • Apply filtering on Introduction Points
    • IPs could rate limit clients / ask for Proof-of-Work / etc.
  • Load balancing traffic through Intro Points ("scaling" ideas)
  • Load balancing traffic through bridges
    • Proposal 188 / s7r

Added #15540 as a child ticket with a short description.

comment:18 in reply to:  16 Changed 3 years ago by dgoulet

Replying to arma:

Replying to asn:

Can we figure out whether such volume of INTRODUCE1 cells is possible without #15515? If the attacker is not using #15515, and the IP can handle that many circuits, why can't our hidden service also handle them?
[...] on the first logs, the HS had 3 IPs. The first IP sent us 11k INTRODUCE2 cells, the second 3.5k INTRODUCE2 cells, and the last only 200. Similarly, on the last logs the first IP sent 6k INTRODUCE2 cells, the second 3k INTRODUCE2 cells and the last about 50. What I'm trying to say here friends is that the distribution is not uniform as would be expected by a normal client, and also the two distributions are quite similar.

Another explanation (alas) might be that each of the main two intro points here had different capacity to handle incoming requests, so they each got saturated at a different level.

I doubt that's the case because there is an ordering where we see a sequential progression over time, that is 11k from IP1 *and then* 3.5k from IP2 *and then * the 200 from IP3. There is a small overlap between each IPs but they are all ordered in time.

If IP capacity was the issue, I think we would have seen more overlap between IP here and not this clean cut in time on *both* attacks (in the two different logs).

comment:19 Changed 3 years ago by dgoulet

Keywords: SponsorU added; SponsorZ removed

comment:20 Changed 3 years ago by dgoulet

Priority: criticalmajor

comment:21 Changed 2 years ago by nickm

Keywords: TorCoreTeam201507 added

comment:22 Changed 2 years ago by nickm

Owner: set to yawning
Status: newassigned

comment:23 Changed 2 years ago by nickm

Keywords: TorCoreTeam201508 added; TorCoreTeam201507 removed

comment:24 Changed 2 years ago by nickm

Keywords: TorCoreTeam201509 added; TorCoreTeam201508 removed

comment:25 Changed 2 years ago by nickm

Milestone: Tor: 0.2.7.x-finalTor: 0.2.8.x-final

comment:26 Changed 2 years ago by nickm

Keywords: SponsorU removed
Sponsor: SponsorU

Bulk-replace SponsorU keyword with SponsorU field.

comment:27 Changed 2 years ago by dgoulet

Keywords: SponsorR removed
Sponsor: SponsorUSponsorR

comment:28 Changed 2 years ago by dgoulet

Keywords: TorCoreTeam201509 removed
Milestone: Tor: 0.2.8.x-finalTor: 0.2.???
Priority: majornormal

comment:29 Changed 21 months ago by dgoulet

Sponsor: SponsorRSponsorR-can

Move those from SponsorR to SponsorR-can.

comment:30 Changed 13 months ago by teor

Milestone: Tor: 0.2.???Tor: 0.3.???

Milestone renamed

comment:31 Changed 12 months ago by nickm

Keywords: tor-03-unspecified-201612 added
Milestone: Tor: 0.3.???Tor: unspecified

Finally admitting that 0.3.??? was a euphemism for Tor: unspecified all along.

comment:32 Changed 7 months ago by nickm

Keywords: tor-03-unspecified-201612 removed

Remove an old triaging keyword.

comment:33 Changed 7 months ago by dgoulet

Keywords: performance added
Severity: Normal
Sponsor: SponsorR-can
Status: assignednew

comment:34 Changed 6 months ago by nickm

Keywords: dos added
Sponsor: SponsorR-can

comment:35 Changed 5 weeks ago by asn

Parent ID: #24298
Note: See TracTickets for help on using tickets.