Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#9093 closed enhancement (fixed)

Better, fairer circuit OOM handling

Reported by: nickm Owned by:
Priority: High Milestone: Tor: 0.2.4.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-relay 023-backport oom
Cc: robgjansen, Flo, nickm, rpw Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

With our second attempt at a #9063 fix, merged into 0.2.4.14-alpha, I introduced an OOM handler that kills circuits if we're too low on memory. But the obvious algorithm I used ("Algorithm 1" as described on #9072) is not as good as it could be or should be.

Currently, I favor looking for a good estimator of "How long will this circuit take to drain all of its currently queued cells?" and using that for deciding which circuits to kill when low on RAM. There are other suggestions on #9072 too. And to those a suggestion from IRC that we look at the age of the oldest cell on the circuit.

If we find something that uses data that Tor currently captures (for example, with the ewma machinery), it will be much easier to deploy.

For whatever we pick, we need to analyze its security implications and look for ways to game it to provoke relays to do something stupid.

It would be good to have a default OOM threshold computed in some sane (though probably nonportable) way based on available RAM, with a reasonable cap. That might be ridiculously hard though.

It would also be good to see about taking more potentially-big things into account, not just circuit queues.

Child Tickets

Change History (8)

comment:1 Changed 4 years ago by nickm

Keywords: 023-backport added; 024-backport removed
Milestone: Tor: 0.2.5.x-finalTor: 0.2.4.x-final
Status: newneeds_review
Summary: We should have better, fairer OOM handlingBetter, fairer circuit OOM handling

In a forthcoming paper, Jansen, Tschorsch, Johnson, and Scheuermann describe an improved algorithm: kill circuits starting with those whose oldest cell (the one at the front of the queue) was inserted first. At first glance, this seems solid: "cell have been waiting a long time to drain" does seem like a good way to detect stuck or stalled or slow circuits which aren't likely to drain on their own.

See branch "bug9093_023" in my public repository.

comment:2 Changed 4 years ago by nickm

Keywords: oom added

comment:3 Changed 4 years ago by arma

Patch looks fine. The only part that popped out as weird is where circuit_max_queued_cell_age() looks at both the age in the n_conn and also the age in the p_conn.

We've been mostly phrasing the attack as an attack on guards, so in that case only p_conn matters. But I guess we could imagine this attack where you extend your circuit to a colluding relay (maybe even not one in the consensus), and then do the attack in reverse. Sounds good.

The other variant of the attack, where we upload a large file to a colluding webserver which stops reading, in hopes of filling the ram at the exit relay... I guess the exit will stop sending sendmes when the webserver stops accepting bytes. Which means the attack would be a parallel one, where you queue your 500 cells (250KB) and then move on to a new connection to do it again. Is that variant handled by this fix (do most of those cells hang out on an edge outbuf, or in a circuit queue?), or is it a new ticket?

comment:4 in reply to:  3 Changed 4 years ago by nickm

Replying to arma:

Patch looks fine. The only part that popped out as weird is where circuit_max_queued_cell_age() looks at both the age in the n_conn and also the age in the p_conn.

We've been mostly phrasing the attack as an attack on guards, so in that case only p_conn matters. But I guess we could imagine this attack where you extend your circuit to a colluding relay (maybe even not one in the consensus), and then do the attack in reverse. Sounds good.

Yup, that's the idea.

The other variant of the attack, where we upload a large file to a colluding webserver which stops reading, in hopes of filling the ram at the exit relay... I guess the exit will stop sending sendmes when the webserver stops accepting bytes. Which means the attack would be a parallel one, where you queue your 500 cells (250KB) and then move on to a new connection to do it again. Is that variant handled by this fix (do most of those cells hang out on an edge outbuf, or in a circuit queue?), or is it a new ticket?

I think it's a new ticket, where we expand our OOM killer to look at edge connections too.

Do you think this one is good to merge in 0.2.4?

comment:5 Changed 4 years ago by nickm

Merged into 0.2.3 and later. The new ticket is #10169.

comment:6 Changed 4 years ago by arma

Resolution: fixed
Status: needs_reviewclosed

Closing this one as done.

(I pointed out to Nick that this patch does crazy things if your clock moves back in time, since all the ages are unsigned. He said that only stupid people let their clock move back. And then he opened #10168.)

comment:7 in reply to:  6 Changed 4 years ago by nickm

Replying to arma:

He said that only stupid people let their clock move back.

Well, I also said that this is only a problem if you let your clock move back, and THEN you hit OOM.

Also that it's stupid to let a clock move back, and that it's kind of stupid for us not to be doing monotonic clocks.

comment:8 in reply to:  1 Changed 4 years ago by robgjansen

Replying to nickm:

In a forthcoming paper, Jansen, Tschorsch, Johnson, and Scheuermann describe an improved algorithm: kill circuits starting with those whose oldest cell (the one at the front of the queue) was inserted first.

See here:
https://www-users.cs.umn.edu/~jansen/publications/sniper-ndss2014.pdf

Note: See TracTickets for help on using tickets.