Opened 13 months ago

Last modified 11 days ago

#26294 assigned defect

attacker can force intro point rotation by ddos

Reported by: arma Owned by: asn
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-hs, tor-dos, network-team-roadmap-2019-Q1Q2
Cc: asn Actual Points:
Parent ID: #29999 Points: 7
Reviewer: Sponsor: Sponsor27-must

Description

Currently, an onion service's intro points each expire (intentionally rotate) after receiving rand(16384, 16384*2) intro requests.

Imagine an attacker who generates many introduction attempts. Since each intro attempt can take its own path to the target intro point, the bottleneck will be the introduction circuit itself. Let's say that intro circuit can sustain 500KBytes/s of traffic. That's 1000 intro requests per second coming in -- so after 24ish seconds (rand(16,32)), that intro point will expire: the onion service will pick a new one and start publishing new onion descriptors.

If the intro circuit can handle 1MByte/s, then rotation will happen after 12ish seconds.

With three intro circuits, each receiving intro requests at a different rate, we could end up changing our descriptor even more often than this. There are at least four impacts from this attack:

(1) Onion services spend energy and bandwidth generating new intro circuits, and publishing new descriptors to list them.

(2) Clients might get the last onion descriptor, not the next one, and so they'll attempt to introduce to a circuit that's no longer listening.

(3) The intro points themselves get a surprise 16k-32k incoming circuits, probably plus a lot more after that because the attacker wouldn't know when to stop. Not only that, but for v2 onion services these circuits use the slower TAP as the circuit handshake at the intro point.

(4) The HSDirs get a new descriptor every few seconds, which aside from the bandwidth and circuit load, tells them that the onion service is under attack like this.

Intro points that can handle several megabytes of traffic per second will keep up and push the intro requests back to the onion service, thus hastening the rotation. Intro points that *can't* handle that traffic will become congested and no fun to use for others during the period of the attack.

The reason we rotate after 16k-32k requests is because the intro point keeps a replay cache, to avoid ever responding to a given intro request more than once.

One direction would be to work on bumping up the size of the replay cache, or designing a different data structure like a bloom filter so we can scale the replay cache better. I think we could succeed there. The benefits would be to (1) and (2) and (4) above, i.e. onion services won't spend so much time making new descriptors, and clients will be more likely to use an onion descriptor that's still accurate. The drawback would be to (3), where the hotspots last longer, that is, the poor intro point feels the damage for a longer period of time.

Taking a step back, I think there are two directions we can go here. Option one, we can try to scale to handle the load. We would focus on load balancing better, like reacting to an attack by choosing super fast intro points, and either choosing fast middle nodes too, or some fancier approach like having multiple circuits to your intro point. Option two, we recognize that this volume of introduction requests represents a problem in itself, and try to introduce defenses at the intro point level. Here we focus on proof of work schemes or other ways to slow down the flow, or we improve the protocol to pass along hints about how to sort the intro requests by priority.

Child Tickets

Change History (12)

comment:1 Changed 13 months ago by asn

Cc: asn added

Agreed this is an interesting and useful problem to work on.

comment:2 Changed 13 months ago by dgoulet

Keywords: tor-hs tor-dos added
Milestone: Tor: unspecified

comment:3 Changed 3 months ago by pili

Sponsor: Sponsor27

comment:4 Changed 3 months ago by asn

Sponsor: Sponsor27Sponsor27-must

comment:5 Changed 3 months ago by asn

Points: 7

There are some easy stuff we can do here. Assigning 7 points to do the easy stuff and think about future stuff, if we don't get to fix this completely.

comment:6 Changed 3 months ago by pili

Parent ID: #29999

comment:7 Changed 2 months ago by gaba

Keywords: network-team-roadmap-2019-Q1Q2 added

Add keyword to tickets in network team's roadmap.

comment:8 Changed 8 weeks ago by asn

Back when that ticket was filed, I also had the chance to meet with some onion service experts and independently discuss this issue. Here are some unpublished notes:

We decided that allowing this attack because of the replay cache is a red herring. Specifically, the replay cache is not that big with only 16k-32k requests so we could indeed grow it. Furthermore, we could also clear the cache after X requests and start with a new one; that would allow the attacker to replay each introduction once, but that's fine because making new intro requests is not *that heavy* anyway, and it's definitely better than allowing them to rotate our intro points non-stop.

Also it's important to realize that the replay cache is held on the HS-side and not on the intropoint-side. I just verified this in our codebase, because I was also confused about this! The HS keeps two (!) replay caches for each INTRODUCE2 cell: one is per-intropoint (v3: replay_cache / v2: accepted_intro_rsa_parts) and the other is per-HS and (v3: replay_cache_rend_cookie / v2: accepted_intro_dh_parts).

I think what we should do here is:

a) Short-term: Reevaluate our replay detection strategy and see whether it's indeed too heavy. Evaluate whether we need both caches. Evalute the size of our replay caches given X requests. Evaluate whether we can clear our replay caches after Y requests and just keep on using the same intro points.

c) Medium-term: Consider more high-level directions to handle big load, like proof of work, path selection, or other intro protocol.s

comment:9 Changed 8 weeks ago by asn

Owner: set to asn
Status: newassigned

comment:10 Changed 6 weeks ago by arma

I like 'a' as a short term plan.

comment:11 Changed 5 weeks ago by s7r

I like 'a' as a short term plan as well. Proof of work solutions are non trivial engineering challenges, consume time and it eventually still gets down to the simple question how much resources/work/time/bandwidth is the attacker willing to give to pull this of.

what if we add a time based lifetime for each intro point, which will be a random value chosen at intro point selection between n and m hours, along with a ALLOW_RESET_CACHE parameter which will be a random number between o and p and also maintain the intro requests lifetime rand(16384, 16384*2) which will be combined with ALLOW_RESET_CACHE, and rebuild descriptor when the first from these two is reached. This way we don't have to increase the cache but only reset it.

For example:
An onion service selects Z as intro point. It also chooses these random values and remembers them for this intro point:

  • time based lifetime = 5 hours (let's pretend n = 1; m = 6)
  • ALLOW_RESET_CACHE = 1400 (let's pretend ALLOW_RESET_CACHE = rand(100, 7000))
  • intro requests lifetime = 20122 (rand(16384, 16384*2)

Now, this intro point will be rotated either after 5 hours, if the onion service is not under attack, either after 20122 * 1400 = 28,170,800 intro requests.

If high values would have been chose for ALLOW_RESET_CACHE and intro requests lifetime, indeed we will be getting many introduction requests through the same introduction point, but we still have the time based lifetime parameter as a safety precaution that will eventually move us from this introduction point.

We can go even go more crazy about this and use the introduction point measured bandwidth or consensus weight so we choose parameters based on how much the intro point is actually able to support in terms of bandwidth, so we don't end up with maintaining an introduction point that is hammered and can't process the requests because it's too slow. Or find another way to check if the intro point is actually responding to intro requests. But even without these smarter computations the presented solution still has to be better than what we have now.

All 3 parameters must be randomized as described, otherwise we open the door for easier analysis and predictability for attackers, like estimate with high probability when will the intro point change occur, etc. (outside the scope of this ticket).

The numbers for time based lifetime and ALLOW_RESET_CACHE don't have any analysis behind, they are just from top of my head and only to illustrate and example about the logic we need to code. We need to evaluate and choose good parameters for these values, if we think this is a good idea.

comment:12 Changed 11 days ago by cypherpunks

My concern about a proof of work approach is it appears to open a back channel where a hidden service operator has influence over client behaviour. This could result in clients executing possibly rarely used/exploitable codepaths, or new correlation attacks. For example, the hidden service operator sets a requirement for a PoW that takes 1.21 KW to compute. The operator has also hacked in to an energy company with high resolution "smart" meters, then could sit back and watch as users login to the service.

Note: See TracTickets for help on using tickets.