attacker can force intro point rotation by ddos

changed milestone to %Tor: unspecified

added 043-deferred 044-deferred actualpoints::6 component::core tor/tor milestone::Tor: unspecified network-team-roadmap-2020Q1 owner::asn points::7 priority::medium reviewer::dgoulet severity::normal sponsor::27-can status::needs-revision tor-dos tor-hs type::defect labels

Agreed this is an interesting and useful problem to work on.

Trac:
Cc: N/A to asn

Trac:
Keywords: N/A deleted, tor-hs, tor-dos added
Milestone: N/A to Tor: unspecified

Trac:
Sponsor: N/A to Sponsor27

Trac:
Sponsor: Sponsor27 to Sponsor27-must

There are some easy stuff we can do here. Assigning 7 points to do the easy stuff and think about future stuff, if we don't get to fix this completely.

Trac:
Points: N/A to 7

Trac:
Parent: N/A to #29999 (moved)

Add keyword to tickets in network team's roadmap.

Trac:
Keywords: N/A deleted, network-team-roadmap-2019-Q1Q2 added

Back when that ticket was filed, I also had the chance to meet with some onion service experts and independently discuss this issue. Here are some unpublished notes:

We decided that allowing this attack because of the replay cache is a red herring. Specifically, the replay cache is not that big with only 16k-32k requests so we could indeed grow it. Furthermore, we could also clear the cache after X requests and start with a new one; that would allow the attacker to replay each introduction once, but that's fine because making new intro requests is not that heavy anyway, and it's definitely better than allowing them to rotate our intro points non-stop.

Also it's important to realize that the replay cache is held on the HS-side and not on the intropoint-side. I just verified this in our codebase, because I was also confused about this! The HS keeps two (!) replay caches for each INTRODUCE2 cell: one is per-intropoint (v3: replay_cache / v2: accepted_intro_rsa_parts) and the other is per-HS and (v3: replay_cache_rend_cookie / v2: accepted_intro_dh_parts).

I think what we should do here is:

a) Short-term: Reevaluate our replay detection strategy and see whether it's indeed too heavy. Evaluate whether we need both caches. Evalute the size of our replay caches given X requests. Evaluate whether we can clear our replay caches after Y requests and just keep on using the same intro points.

c) Medium-term: Consider more high-level directions to handle big load, like proof of work, path selection, or other intro protocol.s

Trac:
Owner: N/A to asn
Status: new to assigned

I like 'a' as a short term plan.

I like 'a' as a short term plan as well. Proof of work solutions are non trivial engineering challenges, consume time and it eventually still gets down to the simple question how much resources/work/time/bandwidth is the attacker willing to give to pull this of.

what if we add a time based lifetime for each intro point, which will be a random value chosen at intro point selection between n and m hours, along with a ALLOW_RESET_CACHE parameter which will be a random number between o and p and also maintain the intro requests lifetime rand(16384, 16384*2) which will be combined with ALLOW_RESET_CACHE, and rebuild descriptor when the first from these two is reached. This way we don't have to increase the cache but only reset it.

For example: An onion service selects Z as intro point. It also chooses these random values and remembers them for this intro point:

time based lifetime = 5 hours (let's pretend n = 1; m = 6)
ALLOW_RESET_CACHE = 1400 (let's pretend ALLOW_RESET_CACHE = rand(100, 7000))
intro requests lifetime = 20122 (rand(16384, 16384*2)

Now, this intro point will be rotated either after 5 hours, if the onion service is not under attack, either after 20122 * 1400 = 28,170,800 intro requests.

If high values would have been chose for ALLOW_RESET_CACHE and intro requests lifetime, indeed we will be getting many introduction requests through the same introduction point, but we still have the time based lifetime parameter as a safety precaution that will eventually move us from this introduction point.

We can go even go more crazy about this and use the introduction point measured bandwidth or consensus weight so we choose parameters based on how much the intro point is actually able to support in terms of bandwidth, so we don't end up with maintaining an introduction point that is hammered and can't process the requests because it's too slow. Or find another way to check if the intro point is actually responding to intro requests. But even without these smarter computations the presented solution still has to be better than what we have now.

All 3 parameters must be randomized as described, otherwise we open the door for easier analysis and predictability for attackers, like estimate with high probability when will the intro point change occur, etc. (outside the scope of this ticket).

The numbers for time based lifetime and ALLOW_RESET_CACHE don't have any analysis behind, they are just from top of my head and only to illustrate and example about the logic we need to code. We need to evaluate and choose good parameters for these values, if we think this is a good idea.

My concern about a proof of work approach is it appears to open a back channel where a hidden service operator has influence over client behaviour. This could result in clients executing possibly rarely used/exploitable codepaths, or new correlation attacks. For example, the hidden service operator sets a requirement for a PoW that takes 1.21 KW to compute. The operator has also hacked in to an energy company with high resolution "smart" meters, then could sit back and watch as users login to the service.

Replying to cypherpunks:

My concern about a proof of work approach is it appears to open a back channel where a hidden service operator has influence over client behaviour. This could result in clients executing possibly rarely used/exploitable codepaths, or new correlation attacks. For example, the hidden service operator sets a requirement for a PoW that takes 1.21 KW to compute. The operator has also hacked in to an energy company with high resolution "smart" meters, then could sit back and watch as users login to the service.

PoW should be a fixed value on the network consensus or hardcoded, if we want the HS to be capable of configuring it then we should limit the parameters. Thats it.

On the other hand I have two questions on the implementation and replay caches:

-How does the replay cache works for INTRODUCE1 cells? The bug allowing for the same circuit to send many INTRODUCE1 should be closed years ago.

-Why we actually rotate Introduction Points? and why we do it after x INTRODUCE cells and not based on a time, like each 24 hours?

Trac:
Username: cypherbits

Replying to cypherbits:

On the other hand I have two questions on the implementation and replay caches:

-How does the replay cache works for INTRODUCE1 cells? The bug allowing for the same circuit to send many INTRODUCE1 should be closed years ago.

-Why we actually rotate Introduction Points? and why we do it after x INTRODUCE cells and not based on a time, like each 24 hours?

Hello, this is not a discussion forum. Please use the mailing list for such discussions. Please see comment:8 for more info on the replay cache.

And yes, the plan with this ticket is to only rorate intro points based on time, and not based on number of introductions (see comment:8 again).

Trac:
Actualpoints: N/A to 4

OK here we go: https://github.com/torproject/tor/pull/1163

The functionality was not so hard to do, but the tests were a real PITA to write since I needed to create a parseable INTRO2 cell (they actually look quite simple in the final branch but that took tons of experimentation and mocking to do).

WRT v3 code quality, I created a new periodic function called maintain_intro_point_replay_caches() which maintains the replay cache. An alternative (perhaps cleaner but definitely harder) approach would be to make this "max number of entries" to be a parameter of the replaycache and do the purging when we add elements as part of the replaycache subsystem. I tried to do this, but the replaycache code is kinda messy and I opted for the easier approach.

Also, I made good unittests for v3, but I never attempted to do the same for v2. It just seems like too much work, given how much work the v3 test was.

Finally, I have not tested this on chutney or the real network. This is something I need to do before putting it in merge_ready.

Trac:
Status: assigned to needs_review
Actualpoints: 4 to 6

Trac:
Reviewer: N/A to dgoulet

Two tiny comments. Else, this is solid! Put it in merge_ready once teor's and my comments have been addressed.

LGTM!

I'm currently running this on our test bed. We'll let you know if anything comes up but so far so good for upstream merge!

Trac:
Status: needs_review to needs_revision

attacker can force intro point rotation by ddos

Child items 0

Activity