#9093 closed enhancement (fixed)
Better, fairer circuit OOM handling
Reported by: | nickm | Owned by: | |
---|---|---|---|
Priority: | High | Milestone: | Tor: 0.2.4.x-final |
Component: | Core Tor/Tor | Version: | |
Severity: | Keywords: | tor-relay 023-backport oom | |
Cc: | robgjansen, Flo, nickm, rpw | Actual Points: | |
Parent ID: | Points: | ||
Reviewer: | Sponsor: |
Description
With our second attempt at a #9063 fix, merged into 0.2.4.14-alpha, I introduced an OOM handler that kills circuits if we're too low on memory. But the obvious algorithm I used ("Algorithm 1" as described on #9072) is not as good as it could be or should be.
Currently, I favor looking for a good estimator of "How long will this circuit take to drain all of its currently queued cells?" and using that for deciding which circuits to kill when low on RAM. There are other suggestions on #9072 too. And to those a suggestion from IRC that we look at the age of the oldest cell on the circuit.
If we find something that uses data that Tor currently captures (for example, with the ewma machinery), it will be much easier to deploy.
For whatever we pick, we need to analyze its security implications and look for ways to game it to provoke relays to do something stupid.
It would be good to have a default OOM threshold computed in some sane (though probably nonportable) way based on available RAM, with a reasonable cap. That might be ridiculously hard though.
It would also be good to see about taking more potentially-big things into account, not just circuit queues.
Child Tickets
Change History (8)
comment:1 follow-up: 8 Changed 5 years ago by
Keywords: | 023-backport added; 024-backport removed |
---|---|
Milestone: | Tor: 0.2.5.x-final → Tor: 0.2.4.x-final |
Status: | new → needs_review |
Summary: | We should have better, fairer OOM handling → Better, fairer circuit OOM handling |
comment:2 Changed 5 years ago by
Keywords: | oom added |
---|
comment:3 follow-up: 4 Changed 5 years ago by
Patch looks fine. The only part that popped out as weird is where circuit_max_queued_cell_age() looks at both the age in the n_conn and also the age in the p_conn.
We've been mostly phrasing the attack as an attack on guards, so in that case only p_conn matters. But I guess we could imagine this attack where you extend your circuit to a colluding relay (maybe even not one in the consensus), and then do the attack in reverse. Sounds good.
The other variant of the attack, where we upload a large file to a colluding webserver which stops reading, in hopes of filling the ram at the exit relay... I guess the exit will stop sending sendmes when the webserver stops accepting bytes. Which means the attack would be a parallel one, where you queue your 500 cells (250KB) and then move on to a new connection to do it again. Is that variant handled by this fix (do most of those cells hang out on an edge outbuf, or in a circuit queue?), or is it a new ticket?
comment:4 Changed 5 years ago by
Replying to arma:
Patch looks fine. The only part that popped out as weird is where circuit_max_queued_cell_age() looks at both the age in the n_conn and also the age in the p_conn.
We've been mostly phrasing the attack as an attack on guards, so in that case only p_conn matters. But I guess we could imagine this attack where you extend your circuit to a colluding relay (maybe even not one in the consensus), and then do the attack in reverse. Sounds good.
Yup, that's the idea.
The other variant of the attack, where we upload a large file to a colluding webserver which stops reading, in hopes of filling the ram at the exit relay... I guess the exit will stop sending sendmes when the webserver stops accepting bytes. Which means the attack would be a parallel one, where you queue your 500 cells (250KB) and then move on to a new connection to do it again. Is that variant handled by this fix (do most of those cells hang out on an edge outbuf, or in a circuit queue?), or is it a new ticket?
I think it's a new ticket, where we expand our OOM killer to look at edge connections too.
Do you think this one is good to merge in 0.2.4?
comment:6 follow-up: 7 Changed 5 years ago by
Resolution: | → fixed |
---|---|
Status: | needs_review → closed |
Closing this one as done.
(I pointed out to Nick that this patch does crazy things if your clock moves back in time, since all the ages are unsigned. He said that only stupid people let their clock move back. And then he opened #10168.)
comment:7 Changed 5 years ago by
Replying to arma:
He said that only stupid people let their clock move back.
Well, I also said that this is only a problem if you let your clock move back, and THEN you hit OOM.
Also that it's stupid to let a clock move back, and that it's kind of stupid for us not to be doing monotonic clocks.
comment:8 Changed 5 years ago by
Replying to nickm:
In a forthcoming paper, Jansen, Tschorsch, Johnson, and Scheuermann describe an improved algorithm: kill circuits starting with those whose oldest cell (the one at the front of the queue) was inserted first.
See here:
https://www-users.cs.umn.edu/~jansen/publications/sniper-ndss2014.pdf
In a forthcoming paper, Jansen, Tschorsch, Johnson, and Scheuermann describe an improved algorithm: kill circuits starting with those whose oldest cell (the one at the front of the queue) was inserted first. At first glance, this seems solid: "cell have been waiting a long time to drain" does seem like a good way to detect stuck or stalled or slow circuits which aren't likely to drain on their own.
See branch "bug9093_023" in my public repository.