Feb 12 18:54:55 tornode2 Tor[6362]: We're low on memory (cell queues total alloc: 1602579792 buffer total alloc: 1388544, tor compress total alloc: 1586784 rendezvous cache total alloc: 489909). Killing circuits withover-long queues. (This behavior is controlled by MaxMemInQueues.)Feb 12 18:54:56 tornode2 Tor[6362]: Removed 1599323088 bytes by killing 1 circuits; 39546 circuits remain alive. Also killed 0 non-linked directory connections.
Notice the ~1GB of cells for one single circuit? Somehow, there is an issue in tor that makes it possible to fill up the circuit cell queue while the scheduler is just not emptying that queue.
This is not trivial so I'll try to explain what I can see:
In append_cell_to_circuit_queue(), there is a check on the cell queue for a maximum size which then makes the circuit to stop reading on the connection if reached:
/* If we have too many cells on the circuit, we should stop reading from * the edge streams for a while. */ if (!streams_blocked && queue->n >= CELL_QUEUE_HIGHWATER_SIZE) set_streams_blocked_on_circ(circ, chan, 1, 0); /* block streams */
In set_streams_blocked_on_circ() there is this non documented non trivial if as the very first thing it does:
if (circ->n_chan == chan) { circ->streams_blocked_on_n_chan = block; if (CIRCUIT_IS_ORIGIN(circ)) edge = TO_ORIGIN_CIRCUIT(circ)->p_streams; } else { circ->streams_blocked_on_p_chan = block; tor_assert(!CIRCUIT_IS_ORIGIN(circ)); edge = TO_OR_CIRCUIT(circ)->n_streams; } [stop reading on all the "edge" streams]
Lets use the example where we are a Guard and we have an or_circuit_t with the "p_chan" being the client connection and the "n_chan" being the connection to the middle node.
If we set the block on n_chan, we would only stop the read() if the circuit is origin because p_streams is only set on a tor client.
Does this means that if we try to deliver a cell forward (n_chan), even though we've reached the high limit, we'll still queue it because there is never a time where we check if we are blocked on "n_chan" at the relay level?
Per a discussion with armadev on IRC, this situation is unfortunately possible and the only defense we have in place is the OOM handler.
We should thought of having a upper limit on any given circuit queue which would indicate that we queued stuff but weren't suppose if the tor client was acting normally.
Just an FYI, the frequency seems to increase. Which I think has a negative effect for the entire network, as in the times leading up to this the CPU usage doubles (possibly higher, but hitting max of VM) the node becomes sluggish and so does the network performance for this host.
So possibly it's language, but the status and previous comment seems to indicate this won't be fixed? Should the operator then tweak the MaxMemInQueues by the # of expected circuits times X KiB to assure this doesn't have such an impact?
Feb 12 18:54:56 tornode2 Tor[6362]: Removed 1599323088 bytes by killing 1 circuits; 39546 circuits remain alive. Also killed 0 non-linked directory connections.
Feb 14 18:46:29 tornode2 Tor[13637]: Removed 519278496 bytes by killing 1 circuits; 122786 circuits remain alive. Also killed 0 non-linked directory connections.
Feb 14 19:04:21 tornode2 Tor[13637]: Removed 785149728 bytes by killing 1 circuits; 111745 circuits remain alive. Also killed 0 non-linked directory connections.
Feb 14 19:46:44 tornode2 Tor[13637]: Removed 1510173456 bytes by killing 1 circuits; 98098 circuits remain alive. Also killed 0 non-linked directory connections.
Feb 15 14:14:18 tornode1 Tor[23044]: Removed 1360147536 bytes by killing 1 circuits; 58922 circuits remain alive. Also killed 0 non-linked directory connections.
Feb 16 01:58:55 tornode1 Tor[23044]: Removed 509234880 bytes by killing 4 circuits; 57780 circuits remain alive. Also killed 0 non-linked directory connections.
Feb 16 02:07:22 tornode1 Tor[23044]: Removed 699879840 bytes by killing 1 circuits; 53007 circuits remain alive. Also killed 0 non-linked directory connections.
Feb 16 03:05:34 tornode1 Tor[23044]: Removed 1593724176 bytes by killing 2 circuits; 52509 circuits remain alive. Also killed 0 non-linked directory connections.
An upper limit on the circuit queue bound with the SENDME logic would look like this: bug25226_033_01
The gist is that if the queue size goes above the circuit window start maximum limit (1000), the circuit is closed with TORPROTOCOL reason. Assuming we ever reach that limit, it means something is wrong in the path and the edge connection keeps sending stuff even though it shouldn't.
Note from IRC: I like the idea here as a band-aid. I think that the maximum should be configured via a consensus param, though, and default to something a bit higher than 1000?
Also I really want to know what Roger thinks of that idea.
In the long term, we're not going to get long-term fixes for the performance issues caused by being low on memory of this until we can better identify memory pressure: either by setting MaxMemInQueues more aggressively, working with the OS/allocator somehow, asking the user to configure MaxMemInQueues to a lower value, or something like that.
An upper limit on the circuit queue bound with the SENDME logic would look like this: bug25226_033_01
The gist is that if the queue size goes above the circuit window start maximum limit (1000), the circuit is closed with TORPROTOCOL reason. Assuming we ever reach that limit, it means something is wrong in the path and the edge connection keeps sending stuff even though it shouldn't.
tor-spec.txt:
To control a circuit's bandwidth usage, each OR keeps track of two 'windows', consisting of how many RELAY_DATA cells it is allowed to originate (package for transmission), and how many RELAY_DATA cells it is willing to consume (receive for local streams). These limits do not apply to cells that the OR receives from one host and relays to another.
tor-design.pdf:
Leaky-pipe circuit topology: Through in-band signaling within the circuit, Tor initiators can direct traffic to nodes partway down the circuit. This novel approach allows traffic to exit the circuit from the middle — possibly frustrating traffic shape and volume attacks based on observing the end of the circuit. (It also allows for long-range padding if future research shows this to be worthwhile.)
An upper limit on the circuit queue bound with the SENDME logic would look like this: bug25226_033_01
The gist is that if the queue size goes above the circuit window start maximum limit (1000), the circuit is closed with TORPROTOCOL reason. Assuming we ever reach that limit, it means something is wrong in the path and the edge connection keeps sending stuff even though it shouldn't.
tor-spec.txt:
To control a circuit's bandwidth usage, each OR keeps track of two 'windows', consisting of how many RELAY_DATA cells it is allowed to originate (package for transmission), and how many RELAY_DATA cells it is willing to consume (receive for local streams). These limits do not apply to cells that the OR receives from one host and relays to another.
We can change this part of the spec.
tor-design.pdf:
Leaky-pipe circuit topology: Through in-band signaling within the circuit, Tor initiators can direct traffic to nodes partway down the circuit. This novel approach allows traffic to exit the circuit from the middle — possibly frustrating traffic shape and volume attacks based on observing the end of the circuit. (It also allows for long-range padding if future research shows this to be worthwhile.)
Standard Tor clients do not use this feature, but they get close:
connection netflow padding is at link level, not circuit level
client introduce circuits send a single INTRODUCE1 cell, then EXTEND if the introduction fails
This has been discussed at the Rome meeting with arma and Robjansen.
The gist is that tor does keep a rolling count of circuit level SENDMEs at each hop on the circuit because the tor protocol allow the client to exit at any hop on the circuit.
However, normal circumstances, it is end-to-end from client to Exit. Considering a the top limit of 8 hops per circuits, the circuit cell queue should be at most 8000 cells (circ window start * 8 hops).
This is not trivial so I'll try to explain what I can see:
In append_cell_to_circuit_queue(), there is a check on the cell queue for a maximum size which then makes the circuit to stop reading on the connection if reached:
Careful here! You're right, but by "connection" you mean edge connection, or stream. Not TLS ("OR") connection.
Lets use the example where we are a Guard and we have an or_circuit_t with the "p_chan" being the client connection and the "n_chan" being the connection to the middle node.
If we set the block on n_chan, we would only stop the read() if the circuit is origin because p_streams is only set on a tor client.
Correct. That whole set_streams_blocked_on_circ() business is for streams, i.e. edge connections. Specifically, origin streams when we're a client, or exit streams when we're an exit.
Does this means that if we try to deliver a cell forward (n_chan), even though we've reached the high limit, we'll still queue it because there is never a time where we check if we are blocked on "n_chan" at the relay level?
This old ticket is really important here, because there was an earlier ticket (#9063 (moved)) that proposed to limit the number of cells in-flight on a circuit, and #9072 (moved) is arguing that it opens up a bad guard discovery attack.
Mind you, when I opened #9072 (moved), it was simply about how our protocol actually allows a huge number of cells in-flight, because non-data cells don't count in the sendme windows, so there is no easy small number where if a circuit goes over you know it's violating protocol.
Mike was the one who retitled my ticket to be about guard discovery attacks. He says:
The attack enabled by #9063 is extremely similar to the Guard discovery attack from rpw's paper.
I wonder what the guard discovery attack actually is? Nobody says what it is on that ticket.
So it would seem smart to reconstruct what it was we were worried about, to figure out if we should still be worried.
I am bringing Mikeperry back into this ticket so he can channel his five-year-earlier self and tell us what the danger is.
our protocol actually allows a huge number of cells in-flight, because non-data cells don't count in the sendme windows
One of the proposed longer term fixes here is to make all kinds of relay cells count in the sendme windows. Then you wouldn't have these weird side channel issues.
The first problem to solve if we want to do that is the deadlock question: if both sides have exhausted their package window, then nobody can acknowledge anything, and they're stuck. This one could be solved by making only sendme cells be an exception to sendme windows, which still lets us keep to a quite limited number of in-flight cells.
After that one, there might be other issues, like "I can't send these end cells until I've gotten a sendme from you", or "I can't send this begin cell until I've gotten a sendme", which aren't the end of the world but might produce some surprising usability issues.
It's worth exploring more in case we can convince ourselves the benefits outweigh the costs!
This old ticket is really important here, because there was an earlier ticket (#9063 (moved)) that proposed to limit the number of cells in-flight on a circuit, and #9072 (moved) is arguing that it opens up a bad guard discovery attack.
Mind you, when I opened #9072 (moved), it was simply about how our protocol actually allows a huge number of cells in-flight, because non-data cells don't count in the sendme windows, so there is no easy small number where if a circuit goes over you know it's violating protocol.
Mike was the one who retitled my ticket to be about guard discovery attacks. He says:
{{{
The attack enabled by #9063 (moved) is extremely similar to the Guard discovery attack from rpw's paper.
}}}
I wonder what the guard discovery attack actually is? Nobody says what it is on that ticket.
So it would seem smart to reconstruct what it was we were worried about, to figure out if we should still be worried.
I am bringing Mikeperry back into this ticket so he can channel his five-year-earlier self and tell us what the danger is.
The attack in #9072 (moved) was due to allowing a small, fixed amount of stream creates in flight to kill the circuit. If this number is small and predicable, a malicious website can create that many streams to cause the circuit to fail, at which point Tor will rebuild a new one on a new path. The website gets to rinse and repeat, and create enough circuits until one of its middles is chosen next to the guard.
I think we still need to figure out possibly a better default value or at the very least a consensus parameter that makes sense.
Big picture review: I think we should proceed with doing this feature, even though we can't really pick a low threshold yet.
I see three benefits for putting this feature in:
We should pick a really high threshold for the consensus, like 50000 cells or 100000 cells, which is essentially at the "oom attempt" level, and now we're killing circuits when they overload us a lot, without needing to wait until we're actually running out of memory, and without needing to have our reaction be a function of how much memory the relay has.
I was originally going to say "I don't think there's any number where we should set this in the consensus right now on the main Tor network," but I think at the 50k or 100k cell mark, even if somebody is following the protocol, we could still kill the circuit "because fairness".
If things go to shit in the future and people start doing bad things to the network that we're not expecting right now, then this would be another available tool for letting relays defend themselves. Shipping it out now will mean it's in place if we decide we need it.
The test networks, where they know the client and website traffic behaviors, can set it to a much lower value, and use it for debugging when they hit the threshold.
For that last one, there are really two things we want to understand here. First, what are the limits on acceptable behavior by "honest" users? That is, what is the threshold above which we say "no honest user would attempt that". And second, are there bugs or surprises in our current design that cause us to hit a higher threshold than we meant to? And it's that second one that a good network testing harness, plus this ticket, can discover.
To summarize arma, we should get this in, keep the current default in the code as is but use a much larger one network wide (consensus). But also, you seem to be suggesting that it would be useful for relay operator to maybe control that value somehow which means it needs to become also a possible torrc option? (the patch doesn't have that).
50k cells is ~25MB so very little. Going to 50MB for 100k cells. I've seen 8001 cells once on my fast relay in the last 7 days so 50k cells.
But also, you seem to be suggesting that it would be useful for relay operator to maybe control that value somehow which means it needs to become also a possible torrc option? (the patch doesn't have that).
Ah, no, I think the way we'd let relays defend themselves is just by turning the knob down on the network-wide consensus value, so relays automatically defend themselves more aggressively.
+ if (PREDICT_UNLIKELY(queue->n > max_circuit_cell_queue_size)) {
This wants to be >=, right? Otherwise we let it grow one past the maximum.
and then in the log message, s/cell/cells/.
The default if we don't have a consensus param is 9000? I'm a bit nervous about edge cases where we somehow fail to set a consensus param, or when relays act before they know a consensus param, especially in the future once we've rolled out Mike's "defensive dropping" website fingerprinting design. What would you say to having a default of 50000? Since that's what we're imagining to set as the default for now anyway.
Also, the minimum settable number is 8000? It seems like we want to allow a lower number for chutney experiments where we control all of the traffic and we're trying to check for bugs. I think 1000 might be a little bit too low, for example because if you ask for two streams and they finish, you'll get 500+500+1+1=1002 cells coming at you from the exit. Or if you're using optimistic data, you'll get the CONNECTED cells back first, bumping it up to more like 1004. But still, I think setting the minimum to 1000 is a legitimate choice, since that makes it easy for experimentors to experiment however they see fit. We should just avoid setting the minimum possible value in the consensus, but that's already true for other consensus params, e.g. we should avoid setting DOS_CC_CIRCUIT_BURST_DEFAULT to 1 even though it's allowed by the code.
And, now that we understand more what counts in the sendme window and what doesn't, it might be smart to revise the huge comment to be more accurate.
if (PREDICT_UNLIKELY(queue->n > max_circuit_cell_queue_size)) {
}}}
This wants to be >=, right? Otherwise we let it grow one past the maximum.
and then in the log message, s/cell/cells/.
Fixup commit ed963d7604
The default if we don't have a consensus param is 9000? I'm a bit nervous about edge cases where we somehow fail to set a consensus param, or when relays act before they know a consensus param, especially in the future once we've rolled out Mike's "defensive dropping" website fingerprinting design. What would you say to having a default of 50000? Since that's what we're imagining to set as the default for now anyway.
I'm fine with it. 50k cells is 25MB so ~40 circuits like that would start putting memory pressure at 1GB. Not many circuits but even at 9k limit, the number of circuits would be a bit less than 250 which is not much either.
Fixup commit 7d25850415
Also, the minimum settable number is 8000? It seems like we want to allow a lower number for chutney experiments where we control all of the traffic and we're trying to check for bugs. I think 1000 might be a little bit too low, for example because if you ask for two streams and they finish, you'll get 500+500+1+1=1002 cells coming at you from the exit. Or if you're using optimistic data, you'll get the CONNECTED cells back first, bumping it up to more like 1004. But still, I think setting the minimum to 1000 is a legitimate choice, since that makes it easy for experimentors to experiment however they see fit. We should just avoid setting the minimum possible value in the consensus, but that's already true for other consensus params, e.g. we should avoid setting DOS_CC_CIRCUIT_BURST_DEFAULT to 1 even though it's allowed by the code.
Fixup commit 497954459d
And, now that we understand more what counts in the sendme window and what doesn't, it might be smart to revise the huge comment to be more accurate.
Fixup commit d4ee02a714
And, a changes file. :)
Fixup commit f87c2ff62a
All the above have been pushed in my _01 branch. I have a _02 branch that squashes everything but also changes one commit message to reflect the changes made above ^.
/* Default value is set to a large value so we can handle padding cells * properly which aren't accounted for in the SENDME window. Default is 50000 * allowed cells in the queue resulting in ~25MB. */#define RELAY_CIRC_CELL_QUEUE_SIZE_DEFAULT \ (50 + RELAY_CIRC_CELL_QUEUE_SIZE_MIN)
So until the relay gets (or has) a consensus (e.g. because it just restarted and it's trying to bootstrap but somebody starts using it anyway), any circuits get closed if they get a cell?
Maybe we should initialize it to RELAY_CIRC_CELL_QUEUE_SIZE_DEFAULT at the very start?
So until the relay gets (or has) a consensus (e.g. because it just restarted and it's trying to bootstrap but somebody starts using it anyway), any circuits get closed if they get a cell?
Maybe we should initialize it to RELAY_CIRC_CELL_QUEUE_SIZE_DEFAULT at the very start?