Opened 15 months ago

Last modified 6 days ago

#26294 merge_ready defect

attacker can force intro point rotation by ddos

Reported by: arma Owned by: asn
Priority: Medium Milestone: Tor: 0.4.2.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-hs, tor-dos, network-team-roadmap-august
Cc: asn Actual Points: 6
Parent ID: #29999 Points: 7
Reviewer: dgoulet Sponsor: Sponsor27-must

Description

Currently, an onion service's intro points each expire (intentionally rotate) after receiving rand(16384, 16384*2) intro requests.

Imagine an attacker who generates many introduction attempts. Since each intro attempt can take its own path to the target intro point, the bottleneck will be the introduction circuit itself. Let's say that intro circuit can sustain 500KBytes/s of traffic. That's 1000 intro requests per second coming in -- so after 24ish seconds (rand(16,32)), that intro point will expire: the onion service will pick a new one and start publishing new onion descriptors.

If the intro circuit can handle 1MByte/s, then rotation will happen after 12ish seconds.

With three intro circuits, each receiving intro requests at a different rate, we could end up changing our descriptor even more often than this. There are at least four impacts from this attack:

(1) Onion services spend energy and bandwidth generating new intro circuits, and publishing new descriptors to list them.

(2) Clients might get the last onion descriptor, not the next one, and so they'll attempt to introduce to a circuit that's no longer listening.

(3) The intro points themselves get a surprise 16k-32k incoming circuits, probably plus a lot more after that because the attacker wouldn't know when to stop. Not only that, but for v2 onion services these circuits use the slower TAP as the circuit handshake at the intro point.

(4) The HSDirs get a new descriptor every few seconds, which aside from the bandwidth and circuit load, tells them that the onion service is under attack like this.

Intro points that can handle several megabytes of traffic per second will keep up and push the intro requests back to the onion service, thus hastening the rotation. Intro points that *can't* handle that traffic will become congested and no fun to use for others during the period of the attack.

The reason we rotate after 16k-32k requests is because the intro point keeps a replay cache, to avoid ever responding to a given intro request more than once.

One direction would be to work on bumping up the size of the replay cache, or designing a different data structure like a bloom filter so we can scale the replay cache better. I think we could succeed there. The benefits would be to (1) and (2) and (4) above, i.e. onion services won't spend so much time making new descriptors, and clients will be more likely to use an onion descriptor that's still accurate. The drawback would be to (3), where the hotspots last longer, that is, the poor intro point feels the damage for a longer period of time.

Taking a step back, I think there are two directions we can go here. Option one, we can try to scale to handle the load. We would focus on load balancing better, like reacting to an attack by choosing super fast intro points, and either choosing fast middle nodes too, or some fancier approach like having multiple circuits to your intro point. Option two, we recognize that this volume of introduction requests represents a problem in itself, and try to introduce defenses at the intro point level. Here we focus on proof of work schemes or other ways to slow down the flow, or we improve the protocol to pass along hints about how to sort the intro requests by priority.

Child Tickets

Change History (27)

comment:1 Changed 15 months ago by asn

Cc: asn added

Agreed this is an interesting and useful problem to work on.

comment:2 Changed 14 months ago by dgoulet

Keywords: tor-hs tor-dos added
Milestone: Tor: unspecified

comment:3 Changed 5 months ago by pili

Sponsor: Sponsor27

comment:4 Changed 5 months ago by asn

Sponsor: Sponsor27Sponsor27-must

comment:5 Changed 5 months ago by asn

Points: 7

There are some easy stuff we can do here. Assigning 7 points to do the easy stuff and think about future stuff, if we don't get to fix this completely.

comment:6 Changed 5 months ago by pili

Parent ID: #29999

comment:7 Changed 4 months ago by gaba

Keywords: network-team-roadmap-2019-Q1Q2 added

Add keyword to tickets in network team's roadmap.

comment:8 Changed 4 months ago by asn

Back when that ticket was filed, I also had the chance to meet with some onion service experts and independently discuss this issue. Here are some unpublished notes:

We decided that allowing this attack because of the replay cache is a red herring. Specifically, the replay cache is not that big with only 16k-32k requests so we could indeed grow it. Furthermore, we could also clear the cache after X requests and start with a new one; that would allow the attacker to replay each introduction once, but that's fine because making new intro requests is not *that heavy* anyway, and it's definitely better than allowing them to rotate our intro points non-stop.

Also it's important to realize that the replay cache is held on the HS-side and not on the intropoint-side. I just verified this in our codebase, because I was also confused about this! The HS keeps two (!) replay caches for each INTRODUCE2 cell: one is per-intropoint (v3: replay_cache / v2: accepted_intro_rsa_parts) and the other is per-HS and (v3: replay_cache_rend_cookie / v2: accepted_intro_dh_parts).

I think what we should do here is:

a) Short-term: Reevaluate our replay detection strategy and see whether it's indeed too heavy. Evaluate whether we need both caches. Evalute the size of our replay caches given X requests. Evaluate whether we can clear our replay caches after Y requests and just keep on using the same intro points.

c) Medium-term: Consider more high-level directions to handle big load, like proof of work, path selection, or other intro protocol.s

comment:9 Changed 4 months ago by asn

Owner: set to asn
Status: newassigned

comment:10 Changed 3 months ago by arma

I like 'a' as a short term plan.

comment:11 Changed 3 months ago by s7r

I like 'a' as a short term plan as well. Proof of work solutions are non trivial engineering challenges, consume time and it eventually still gets down to the simple question how much resources/work/time/bandwidth is the attacker willing to give to pull this of.

what if we add a time based lifetime for each intro point, which will be a random value chosen at intro point selection between n and m hours, along with a ALLOW_RESET_CACHE parameter which will be a random number between o and p and also maintain the intro requests lifetime rand(16384, 16384*2) which will be combined with ALLOW_RESET_CACHE, and rebuild descriptor when the first from these two is reached. This way we don't have to increase the cache but only reset it.

For example:
An onion service selects Z as intro point. It also chooses these random values and remembers them for this intro point:

  • time based lifetime = 5 hours (let's pretend n = 1; m = 6)
  • ALLOW_RESET_CACHE = 1400 (let's pretend ALLOW_RESET_CACHE = rand(100, 7000))
  • intro requests lifetime = 20122 (rand(16384, 16384*2)

Now, this intro point will be rotated either after 5 hours, if the onion service is not under attack, either after 20122 * 1400 = 28,170,800 intro requests.

If high values would have been chose for ALLOW_RESET_CACHE and intro requests lifetime, indeed we will be getting many introduction requests through the same introduction point, but we still have the time based lifetime parameter as a safety precaution that will eventually move us from this introduction point.

We can go even go more crazy about this and use the introduction point measured bandwidth or consensus weight so we choose parameters based on how much the intro point is actually able to support in terms of bandwidth, so we don't end up with maintaining an introduction point that is hammered and can't process the requests because it's too slow. Or find another way to check if the intro point is actually responding to intro requests. But even without these smarter computations the presented solution still has to be better than what we have now.

All 3 parameters must be randomized as described, otherwise we open the door for easier analysis and predictability for attackers, like estimate with high probability when will the intro point change occur, etc. (outside the scope of this ticket).

The numbers for time based lifetime and ALLOW_RESET_CACHE don't have any analysis behind, they are just from top of my head and only to illustrate and example about the logic we need to code. We need to evaluate and choose good parameters for these values, if we think this is a good idea.

comment:12 Changed 2 months ago by cypherpunks

My concern about a proof of work approach is it appears to open a back channel where a hidden service operator has influence over client behaviour. This could result in clients executing possibly rarely used/exploitable codepaths, or new correlation attacks. For example, the hidden service operator sets a requirement for a PoW that takes 1.21 KW to compute. The operator has also hacked in to an energy company with high resolution "smart" meters, then could sit back and watch as users login to the service.

comment:13 in reply to:  12 ; Changed 8 weeks ago by cypherbits

Replying to cypherpunks:

My concern about a proof of work approach is it appears to open a back channel where a hidden service operator has influence over client behaviour. This could result in clients executing possibly rarely used/exploitable codepaths, or new correlation attacks. For example, the hidden service operator sets a requirement for a PoW that takes 1.21 KW to compute. The operator has also hacked in to an energy company with high resolution "smart" meters, then could sit back and watch as users login to the service.

PoW should be a fixed value on the network consensus or hardcoded, if we want the HS to be capable of configuring it then we should limit the parameters. Thats it.


On the other hand I have two questions on the implementation and replay caches:

-How does the replay cache works for INTRODUCE1 cells? The bug allowing for the same circuit to send many INTRODUCE1 should be closed years ago.

-Why we actually rotate Introduction Points? and why we do it after x INTRODUCE cells and not based on a time, like each 24 hours?

comment:14 in reply to:  13 Changed 7 weeks ago by asn

Replying to cypherbits:

On the other hand I have two questions on the implementation and replay caches:

-How does the replay cache works for INTRODUCE1 cells? The bug allowing for the same circuit to send many INTRODUCE1 should be closed years ago.

-Why we actually rotate Introduction Points? and why we do it after x INTRODUCE cells and not based on a time, like each 24 hours?

Hello, this is not a discussion forum. Please use the mailing list for such discussions. Please see comment:8 for more info on the replay cache.

And yes, the plan with this ticket is to only rorate intro points based on time, and not based on number of introductions (see comment:8 again).

comment:15 Changed 7 weeks ago by asn

Actual Points: 4

comment:16 Changed 7 weeks ago by asn

Actual Points: 46
Status: assignedneeds_review

OK here we go: https://github.com/torproject/tor/pull/1163

The functionality was not so hard to do, but the tests were a real PITA to write since I needed to create a parseable INTRO2 cell (they actually look quite simple in the final branch but that took tons of experimentation and mocking to do).

WRT v3 code quality, I created a new periodic function called maintain_intro_point_replay_caches() which maintains the replay cache. An alternative (perhaps cleaner but definitely harder) approach would be to make this "max number of entries" to be a parameter of the replaycache and do the purging when we add elements as part of the replaycache subsystem. I tried to do this, but the replaycache code is kinda messy and I opted for the easier approach.

Also, I made good unittests for v3, but I never attempted to do the same for v2. It just seems like too much work, given how much work the v3 test was.

Finally, I have not tested this on chutney or the real network. This is something I need to do before putting it in merge_ready.

comment:17 Changed 7 weeks ago by dgoulet

Reviewer: dgoulet

comment:18 Changed 7 weeks ago by dgoulet

Status: needs_reviewneeds_revision

Two tiny comments. Else, this is solid! Put it in merge_ready once teor's and my comments have been addressed.

LGTM!

I'm currently running this on our test bed. We'll let you know if anything comes up but so far so good for upstream merge!

comment:19 Changed 4 weeks ago by gaba

Keywords: network-team-roadmap-august added; network-team-roadmap-2019-Q1Q2 removed

comment:20 in reply to:  18 Changed 13 days ago by asn

Status: needs_revisionneeds_review

Replying to dgoulet:

Two tiny comments. Else, this is solid! Put it in merge_ready once teor's and my comments have been addressed.

LGTM!

I'm currently running this on our test bed. We'll let you know if anything comes up but so far so good for upstream merge!

OK, I pushed a branch with fixups and rebased to latest master (because of the practracker changes): https://github.com/torproject/tor/pull/1199

Apart from the PR comments, it fixes a bug on the v2 side.

comment:21 Changed 13 days ago by dgoulet

Status: needs_reviewmerge_ready

Travis had a weird failure over "not having Internet for apt install" ... but else this is ready to go.

comment:22 Changed 13 days ago by nickm

Milestone: Tor: unspecifiedTor: 0.4.2.x-final

Assuming this is for 0.4.2, and not backport?

comment:23 in reply to:  22 Changed 13 days ago by asn

Replying to nickm:

Assuming this is for 0.4.2, and not backport?

Agreed.

comment:24 Changed 13 days ago by nickm

Status: merge_readyneeds_revision

It looks like there are assertion failures reported in the chutney test here on travis.

comment:25 in reply to:  24 Changed 12 days ago by asn

Status: needs_revisionmerge_ready

Replying to nickm:

It looks like there are assertion failures reported in the chutney test here on travis.

Oops. I managed to reproduce this with chutney and fixed it. I pushed a fixup. The cause was that v2 initializes the replay cache upon receiving an INTRO2 cell, so it can be uninitialized in the beginning; I added a test against that.

I also pushed a squashed branch because there were some conflicts with autosquash:
https://github.com/torproject/tor/pull/1207

Marking as merge_ready assuming that all CI passes.

comment:26 Changed 10 days ago by nickm

I think this code looks okay but before we merge it, I think we should have a patch for tor-spec that explains the new behavior of the replay cache. We should also have a quick proposal that explains why it's safe to allow replays, since I've usually thought of them as a way to mount active traffic analysis attacks.

comment:27 in reply to:  26 Changed 6 days ago by asn

Replying to nickm:

I think this code looks okay but before we merge it, I think we should have a patch for tor-spec that explains the new behavior of the replay cache. We should also have a quick proposal that explains why it's safe to allow replays, since I've usually thought of them as a way to mount active traffic analysis attacks.

Here is a torspec patch: https://github.com/asn-d6/torspec/commit/f0fbcf3d606b8fb8ec49b1ba8f790607725dbd8b
https://github.com/asn-d6/torspec/tree/bug26294

We actually had not heard that replay caches are there to protect against traffic analysis attacks. How does the attack work? I considered that identical INTRO2 cells could be used as a signal to the HS guard, but since they are end-to-end encrypted the singal should not be visible, right?

Note: See TracTickets for help on using tickets.