Opened 10 months ago

Closed 10 months ago

Last modified 10 months ago

#9072 closed defect (fixed)

#9063 enables Guard discovery in about an hour by websites

Reported by: arma Owned by:
Priority: major Milestone: Tor: 0.2.3.x-final
Component: Tor Version: Tor: 0.2.4.13-alpha
Keywords: Cc: robgjansen, Flo, nickm, rpw
Actual Points: Parent ID:
Points:

Description

There are classes of relay cells that don't count in the circuit windows. Specifically, only data relay cells count in the circuit windows.

So for example, a malicious website could arrange for the client to open 2200 streams to it, and then the website could close them all at once. 2200 relay end cells would work their way back towards the client. The #9063 fix would get upset and kill the circuit.

Child Tickets

Change History (33)

comment:1 Changed 10 months ago by arma

Two more variations:

A) A website just causes Alice to launch 2200 stream connections. Then she gets 2200 connected cells heading her way.

B) If Alice is using the leaky pipe design and has two circ-window-maxes heading her way, but she's also uploading stuff and triggers a sendme cell to return to her, that brings us to 2001.

comment:2 Changed 10 months ago by mikeperry

  • Cc rpw added
  • Priority changed from normal to major

The attack enabled by #9063 is extremely similar to the Guard discovery attack from rpw's paper.

It would appear to allow a malicious website to find your guards in about an hour..

Rock, meet the hard place :/.

comment:3 Changed 10 months ago by mikeperry

  • Summary changed from #9063 went too far to #9063 enables Guard discovery of 0.2.4.x clients in about an hour by websites

Attack is courtesy of our friend boboper.

Thanks bob!

comment:4 Changed 10 months ago by mikeperry

See also #9001 for potential longer-term solutions.

comment:5 Changed 10 months ago by mikeperry

  • Priority changed from major to critical
  • Summary changed from #9063 enables Guard discovery of 0.2.4.x clients in about an hour by websites to #9063 enables Guard discovery in about an hour by websites

Actually, this is *all* clients. Bob pointed out that it is their Guards who have to upgrade for them to be vulnerable, not them.

Pretty bad stuff.

Perhaps we should remove #9063 and save it for Gaurds who actually experience OOMing?

comment:6 Changed 10 months ago by andrea

Geez, fuck -- yeah, probably roll it back unless/until we can find a smarter way to implement it.

comment:7 Changed 10 months ago by andrea

  • Status changed from new to needs_review

My bug9072-023, bug9072-024 and bug9072-025 #if out the #9063 code.

comment:8 Changed 10 months ago by nickm

Hm. Targetted OOM, or guard discovery. Not a great choice.

We might need to do the 0.2.5 version of a #9063 fix that we were talking about, where we defined a maximum space allocatable for cells, and do some kind of OOM killer on on clogged-looking circuits if we run out of space for new queued cells. Not great, and not delightful to contemplate for an 0.2.4 backport, but the OOM situation isn't greater either.

comment:9 follow-up: Changed 10 months ago by nickm

andrea: feel free to merge those for now .

comment:10 Changed 10 months ago by rransom

This doesn't make discovering a client's entry guards significantly easier. It does make failing a client's circuits from the exit/destination end significantly easier, and as soon as the path-bias detector is turned on, that becomes a full remote user-tracing attack.

comment:11 follow-up: Changed 10 months ago by nickm_mobile

Here are a few sketch oom-killer algorithms. Let's try to find flaws in them and find something better.

Algorithm 1.
1) have a max number of allocated cells configured.
2) when that number is reached, consider all circuits and kill those with the longest queues until we are at X% of the limit.

Algorithm 2.
1) as in alg 1 above.
2) when he hit the limit, consider the connections with the largest number of queued cells on circuits aiming for them.
3) kill those connections until we are at X% of the limit.

Algorithm 3.
1, 2) as in alg 2 above.
3) kill the biggest circuits on those connections until we are at X% of our limit.

Algorithms 1T, 2T, 3T:
As algorithms 1, 2, 3, but instead of longest queue, look at longest flush time, where flush time is the queue length divided by the connection's observed bandwidth.

IMO none of 1, 2, 3 is too complex for 0.2,4.

comment:12 follow-up: Changed 10 months ago by nickm_mobile

Algorithms 1U, 2U, 3U:

As algorithms 1T 2T and 3T, but define a queue's flush time as some function that accounts for the number of other circuits targetting the same connection. If we were using round robin, I would say: N/BW where bw is the conn observed bandwidth and N is the sum of all queue lengths on that connection, each queue length clippedd to be no longer than the queue under consideration.

comment:13 Changed 10 months ago by rransom

Instead of closing circuits or connections in response to OOM, you could check whether the relay has hit its rate limit, and if it hasn't, send a bunch of cells along the circuit/connection. (Perhaps the next relay will have more room for them.)

comment:14 follow-up: Changed 10 months ago by rransom

How long are cell queues for normal clients' circuits and connections in practice?

comment:15 Changed 10 months ago by arma

"Maximum number of cells in entrys's circuit queue = 8 * (1000_data + 10_sendmecirc + 20_sendme_stream + N * (1_connected + 1_end)), N is number of maximum allowed streams for client"

"so 10k is actual number for client with 100 streams limit"

(quoting from irc)

If we disable the leaky pipe feature, 2k cells is enough, with 100 stream limit.

Of course, this isn't counting drop cells, which we might want to add in at some point to help with website fingerprinting.

comment:16 in reply to: ↑ 12 ; follow-up: Changed 10 months ago by robgjansen

Replying to nickm_mobile:

Algorithms 1U, 2U, 3U:

As algorithms 1T 2T and 3T, but define a queue's flush time as some function that accounts for the number of other circuits targetting the same connection. If we were using round robin, I would say: N/BW where bw is the conn observed bandwidth and N is the sum of all queue lengths on that connection, each queue length clippedd to be no longer than the queue under consideration.

Is there any reason not to simplify this down to the circuit level only? Anything based on connections is easily subverted by a sybil attack. And you can't consider the time-independent queue length alone because of the sybil attack.

The goal is to drive up the cost of the attack while making sure you are not killing an honest client's circuit. If you consider a function of the queue length and the waiting time of the longest waiting cell for each circuit, then a malicious client will have to read cells at a rate at least as fast as the slowest honest client *on every sybil* in order to cause the relay to target the honest client's circuit. It can be shown that the longer the attacker's queue becomes, the faster it must read from it to avoid selection because it needs to keep the waiting times of the cells in the malicious circuits lower than those of all honest circuits.

comment:17 Changed 10 months ago by robgjansen

I guess authenticated SNEDMEs, where the exit inserts a 1 byte nonce into every 100th cell and won't increase the package window unless the SENDME contains the nonce, is also off of the table since like the queue length limit defense, it doesn't handle non-data cells.

comment:18 Changed 10 months ago by robgjansen

I imagine its not a simple fix, but what are the other problems with making all classes of relay cells count against the circuit package window?

comment:19 Changed 10 months ago by nickm

rransom said:

Instead of closing circuits or connections in response to OOM, you could check whether the relay has hit its rate limit, and if it hasn't, send a bunch of cells along the circuit/connection. (Perhaps the next relay will have more room for them.)

That sounds plausible as a reliability improvement, but unless I misunderstand, it won't actually prevent any DOS here, since the recipient could just stop reading and thereby make the sender's TCP implementation refuse to write any more.

How long are cell queues for normal clients' circuits and connections in practice?

Less than 1000; I don't know how much less.

comment:20 Changed 10 months ago by nickm

arma wrote:

If we disable the leaky pipe feature, 2k cells is enough, with 100 stream limit.

Is there any such 100 stream limit? I'm not seeing it in the code today. So if we want to protect existing clients from this issue, 2k is too low. We'd need to write a patch to limit that, which would create another way for a hostile website to make a client open a ton of circuits. (We could mitigate that a little by applying a limit only to the number of circuits for which we've sent a BEGIN but not gotten a CONNECTED, I guess, and delaying pending streams

So if I'm not wrong about the code as it stands, that implies that N=65535, so the magic number is a hefty 1056816 (or merely 264204 if you don't believe in leaky-pipe). Not so good.

Or am I missing a real 100-stream limit somewhere?

comment:21 Changed 10 months ago by nickm

robgjansen wrote:

I guess authenticated SNEDMEs, where the exit inserts a 1 byte nonce into every 100th cell and won't increase the package window unless the SENDME contains the nonce, is also off of the table since like the queue length limit defense, it doesn't handle non-data cells.
I imagine its not a simple fix, but what are the other problems with making all classes of relay cells count against the circuit package window?

The biggest problems I know with each of these are the migration issues. We can't require that clients and exits use them until all clients and exits support them. That doesn't make them complete non-starters, but it does keep us from deploying them any time, say, in the next two weeks. :)

comment:22 in reply to: ↑ 16 Changed 10 months ago by nickm

Replying to robgjansen:

Is there any reason not to simplify this down to the circuit level only? Anything based on connections is easily subverted by a sybil attack. And you can't consider the time-independent queue length alone because of the sybil attack.

What you say seems plausible. Though it might not be too bad to do a time-independent queue length as "good enough for now" in 0.2.4 and maybe 0.2.3, then work on the better thing in 0.2.5, targeting a possible backport to 0.2.4 if the better thing turns out to be not too hard. After all, if the #9063 solution seemed "good enough as a temporary measure" (barring this issue) then surely a queue-length-based solution is not worse.

I think this here (algorithm 1 in 0.2.3 and 0.2.4; work on better stuff for 0.2.5 for a possible backport) is my current preferred approach, assuming there aren't flaws in it beyond what I know. Finding the exactly-right approach for 0.2.5 seems like it might take a bunch of discussion and analysis.

The goal is to drive up the cost of the attack while making sure you are not killing an honest client's circuit. If you consider a function of the queue length and the waiting time of the longest waiting cell for each circuit, then a malicious client will have to read cells at a rate at least as fast as the slowest honest client *on every sybil* in order to cause the relay to target the honest client's circuit. It can be shown that the longer the attacker's queue becomes, the faster it must read from it to avoid selection because it needs to keep the waiting times of the cells in the malicious circuits lower than those of all honest circuits.

You don't want a function of the longest waiting cell for each circuit, I think: a you want a function of the expected time to drain the queue.[*] That's why I was looking at connection flush rates -- the speed at which the connection is draining, and the lengths of the other circuit queues on the connection, determine how long it will take to drain the queue.

[*] Why look at the expected drain time and not age of the longest waiting cell? Imagine a slow connection where a bunch of legit circuits have had modestly long queues for a while, and a new connection has gotten a really egregious queue really quickly. The ones that have been around longer seem like they could be the ones to get killed, which isn't so great.

comment:23 in reply to: ↑ 9 Changed 10 months ago by andrea

  • Resolution set to fixed
  • Status changed from needs_review to closed

Replying to nickm:

andrea: feel free to merge those for now .

Merged.

comment:24 Changed 10 months ago by nickm

  • Component changed from Tor to Thandy
  • Milestone changed from Tor: 0.2.3.x-final to Tor: 0.2.4.x-final
  • Priority changed from critical to major
  • Resolution fixed deleted
  • Status changed from closed to reopened

I'm going to suggest we keep this open for now to discuss what fix we want to do in 0.2.3/0.2.4.

comment:25 Changed 10 months ago by nickm

  • Component changed from Thandy to Tor

comment:26 Changed 10 months ago by nickm

I wrote:

I think this here (algorithm 1 in 0.2.3 and 0.2.4; work on better stuff for 0.2.5 for a possible backport) is my current preferred approach, assuming there aren't flaws in it beyond what I know. Finding the exactly-right approach for 0.2.5 seems like it might take a bunch of discussion and analysis.

See #9063 for my attempt at hacking that together. It needs review.

comment:27 Changed 10 months ago by mikeperry

From our IRC friend:

"idea with streams limiting for #9072 is wrong too. idea with limiting half open streams is better but still fragile. no way to limit queue length with current protocol. and don't forget about many relay_truncate, relay_truncated, that can be used actively with leaky pipe topology."

I asked him: what if we made a patch that was a torrc option, so we at least had something in a release we could tell people "Turn this on if you're a guard and experience ooms?" how should we implement that option? is there a least bad choice?

limiting streams to limit queue length to protect against killer website but protect against oom is kludge. need to think about torrc option.

comment:28 in reply to: ↑ 14 Changed 10 months ago by andrea

Replying to rransom:

How long are cell queues for normal clients' circuits and connections in practice?

Pretty short; testing the original #9063 patch I had to reduce the maximum to 5 cells to observe circuit kills.

comment:29 in reply to: ↑ 11 ; follow-up: Changed 10 months ago by andrea

Replying to nickm_mobile:

Here are a few sketch oom-killer algorithms. Let's try to find flaws in them and find something better.

Algorithm 1.
1) have a max number of allocated cells configured.
2) when that number is reached, consider all circuits and kill those with the longest queues until we are at X% of the limit.

Making sure I understand this correctly: this is better than the max queue length version because the maximum is controlled by the amount of memory available globally, so the queue fill has to be much, much larger to force a circuit kill needed for the potential exploit?

comment:30 in reply to: ↑ 29 Changed 10 months ago by nickm

Replying to andrea:

Making sure I understand this correctly: this is better than the max queue length version because the maximum is controlled by the amount of memory available globally, so the queue fill has to be much, much larger to force a circuit kill needed for the potential exploit?

Right. With all of the algorithms I outlined above, we only kill circuits once we're really about to be out of memory. So the tricks described in the original ticket here (and in the first comment) aren't sufficient to force a circuit kill.

comment:31 Changed 10 months ago by nickm

  • Milestone changed from Tor: 0.2.4.x-final to Tor: 0.2.3.x-final
  • Resolution set to fixed
  • Status changed from reopened to closed

comment:32 Changed 10 months ago by nickm

(I merged the bug9063_redux_023_squashed branch.)

comment:33 Changed 10 months ago by nickm

The place to figure out the best long-term fix is now #9093.

Note: See TracTickets for help on using tickets.