r9913 took out the "clear cell queue after sending destroy cell" logic.
Bug 1150 (comment 3519) points out that the cells are just going to get
dropped by the other side anyway.
I think it's right.
There are three places that call connection_or_send_destroy().
The first is in relay.c, where we unlink the circ from the n_conn
immediately after.
The second is in command.c, where we refuse new circuits for various
reasons. No cells queued yet.
The third is in circuitlist.c, during circuit-mark-for-close. So the
question is, if the circuit is marked, do we really want to still be
trying to send its cells? Either it got marked from the other side,
in which case the other side has already decided it doesn't want to hear
from us. Or it got marked from our side, in which case either that was
an error (circuit needs to die) or it has no streams on it, in which
case there's no harm in clearing its queue.
Did I miss anything?
Perhaps we should check, in main.c when we're considering whether to mark
it because it's idle, whether there are any cells still in its queue.
Or maybe we should just pass an argument into connection_or_send_destroy()
from the truncate case in relay.c saying "clear the queue please". Then we
can remain ambiguous in the cases above where I said "no harm because surely
the queue has nothing in it". That's certainly the safest approach, but
wouldn't it be nice to simplify rather than complexify? :)
[Automatically added by flyspray2trac: Operating System: All]
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
So if the destroy cell is outgoing, the client really won't process any cells on the circuit, so it's safe for the client to clear the queue. And if it's incoming at a relay, it's going to turn into a truncated, and get queued up behind all the other relay cells for the client. And if it's incoming at a client, there is currently no cell queue there at all, since we crypt cells aggressively on receiving them. So I am pretty sure that clearing cell queues on circuit_mark_for_close is a safe thing to do. I don't think we were sending any cells in these cases, but if we were, they weren't getting processed.
The one case that puzzles me is RELAY_COMMAND_TRUNCATE. We call connection_or_send_destroy() there. But do we really mean to cancel every relay cell that the intermediate relay received before the TRUNCATE command but has not yet relayed, or do we mean them to go on their way? I think it's safe to cancel them given how Tor works now, but our spec doesn't envision that possibility.
Perhaps we should check, in main.c when we're considering whether to mark it because it's idle, whether there are any cells still in its queue.
You were talking about connections, perhaps? I don't see anywhere in main.c where we close idle circuits. But connections don't have cell queues; circuits do.
Anyway, I've hacked up a fix. See branch bug1184 in my public repository.
Trac: Status: new to needs_review Description: r9913 took out the "clear cell queue after sending destroy cell" logic.
Bug 1150 (comment 3519) points out that the cells are just going to get
dropped by the other side anyway.
I think it's right.
There are three places that call connection_or_send_destroy().
The first is in relay.c, where we unlink the circ from the n_conn
immediately after.
The second is in command.c, where we refuse new circuits for various
reasons. No cells queued yet.
The third is in circuitlist.c, during circuit-mark-for-close. So the
question is, if the circuit is marked, do we really want to still be
trying to send its cells? Either it got marked from the other side,
in which case the other side has already decided it doesn't want to hear
from us. Or it got marked from our side, in which case either that was
an error (circuit needs to die) or it has no streams on it, in which
case there's no harm in clearing its queue.
Did I miss anything?
Perhaps we should check, in main.c when we're considering whether to mark
it because it's idle, whether there are any cells still in its queue.
Or maybe we should just pass an argument into connection_or_send_destroy()
from the truncate case in relay.c saying "clear the queue please". Then we
can remain ambiguous in the cases above where I said "no harm because surely
the queue has nothing in it". That's certainly the safest approach, but
wouldn't it be nice to simplify rather than complexify? :)
[Automatically added by flyspray2trac: Operating System: All]
to
r9913 took out the "clear cell queue after sending destroy cell" logic.
Bug 1150 (comment 3519) points out that the cells are just going to get
dropped by the other side anyway.
I think it's right.
There are three places that call connection_or_send_destroy().
The first is in relay.c, where we unlink the circ from the n_conn
immediately after.
The second is in command.c, where we refuse new circuits for various
reasons. No cells queued yet.
The third is in circuitlist.c, during circuit-mark-for-close. So the
question is, if the circuit is marked, do we really want to still be
trying to send its cells? Either it got marked from the other side,
in which case the other side has already decided it doesn't want to hear
from us. Or it got marked from our side, in which case either that was
an error (circuit needs to die) or it has no streams on it, in which
case there's no harm in clearing its queue.
Did I miss anything?
Perhaps we should check, in main.c when we're considering whether to mark
it because it's idle, whether there are any cells still in its queue.
Or maybe we should just pass an argument into connection_or_send_destroy()
from the truncate case in relay.c saying "clear the queue please". Then we
can remain ambiguous in the cases above where I said "no harm because surely
the queue has nothing in it". That's certainly the safest approach, but
wouldn't it be nice to simplify rather than complexify? :)
[Automatically added by flyspray2trac: Operating System: All]
wanoskarnet pointed out on IRC that the behavior on TRUNCATE cells makes the network fragile if anybody uses TRUNCATE. Unfortunately, this isn't new behavior with this patch: the network has operated this way since cell queues were introduced. To make TRUNCATE work right will be another (orthogonal) bugfix. Since nobody's using TRUNCATE after streams now AFAICT, the fix can probably wait till 0.2.3.x.
I've edited the spec in branch bug1184 more to clarify that this isn't working-as-intended.
The two patches leave me confused. The spec edit should probably be squashed into the earlier commit and reworded, and now I wonder why we want to patch this at all until 1784 is resolved. What are we actually fixing with this patch?
See the commit message on the first of the patches:
At best, this patch helps us avoid sending queued relayed cells thatwould get ignored during the time between when a destroy cell issent and when the circuit is finally freed. At worst, it lets usrelease some memory a little earlier than it would otherwise.
Right, I read that. What I mean is, does this actually buy us anything useful in the current network? I was thinking that probably we would want to fix 1784 which is likely to change our behavior here again?
Well, for all cases except the TRUNCATE case, it (might) improve conformance with the spec, which frowns on sending any cells once we've gotten a destroy. Even after fixing 1784, we won't change anything in the non-TRUNCATE cases.
In the TRUNCATE case, the patch (maybe) trades a severely nonconformant broken behavior for a less visibly nonconformant broken behavior. Currently, on a TRUNCATE, we send a DESTROY, maybe send a couple of queued cells after the destroy (bug 1184, fixed here), and drop queued cells when we close the circuit (bug 1784). This patch fixes the "We might send cells on a destroyed circuit" problem, and leaves the "TRUNCATE destroys the circuit too early" problem for later.
Ok thanks! Makes sense now. I've pushed a patch to bug1184 in my repo that changes the tor-spec wording to make more sense to me. Some native english speaker please review. We probably want to squash the three commits into one before merging.
+ [Note: If an OR receives a TRUNCATE cell and it has any RELAY cells+ still queued on the circuit for the next node that it had not yet sent,
s/ that it had not yet sent//
+ it will drop them without sending them. This is not considered+ conformant behavior, but it probably won't get fixed till a later
s/till/until/ (unless we all want to switch over to older English? :)
+ version of Tor. Thus, clients SHOULD NOT send a TRUNCATE cell to a+ node running any current version of Tor if they have sent relay cells+ through that node, and they aren't sure whether those cells have been+ sent on.]
...if a) they have sent relay cells through that node, and b) they aren't sure whether those cells have been sent on yet.
Geez. I just tried to go answer the question "is it the case that when we try to move a cell from circ's cell queue onto the outbuf, we notice that circ->n_conn is null and choose not to send the cell", which moved to "where the heck do we move cells from the circ cell queue onto the outbuf other than connection_or_flush_from_first_active_circuit() which seems to only move one cell and then stop", which moved to "are there any cases where an inactive circuit moves cells to the outbuf" (I think no), which moved to "at what point in closing a circuit do we clear its cell queue" (I guess when we're freeing it and not before).
Then I noticed in connection_edge_process_relay_cell() that we call
circuit_set_n_circid_orconn(circ, 0, NULL);
after the destroy cell (sounds good), but now I'm wondering why we don't call that from the two cases in _circuit_mark_for_close() that send destroy cells. Is it because in those two cases the circuit is going to get marked, which is just as good as setting its n_circid and n_conn to 0,null, for some reason I haven't found yet?
"is it the case that when we try to move a cell from circ's cell queue onto the outbuf, we notice that circ->n_conn is null and choose not to send the cell"
See below....
"where the heck do we move cells from the circ cell queue onto the outbuf other than connection_or_flush_from_first_active_circuit() which seems to only move one cell and then stop"
That's the only place grepping for uses of cell_queue_pop, which is the only place that pulls a cell of a cell queue. The function doesn't "only move one cell and then stop", though. The "for (n_flushed = 0; n_flushed < max && queue->head; ) {" loop in the middle sends more than one cell. It stops when "max" cells have been flushed to the or_conn's outbuf, or when there are no more active circuits (that is, ones with cells to flush).
So, to answer the first question above: the process of moving cells to an orconn outbuf is entirely driven by the orconn looking through its list of active circuits, and not by doing anything to the circuit, it's not possible that a null n_conn on the circ will get followed.
"are there any cases where an inactive circuit moves cells to the outbuf" (I think no)
I agree.
"at what point in closing a circuit do we clear its cell queue" (I guess when we're freeing it and not before).
Right. cell_queue_clear() is called in circuit_free and nowhere else. [This branch changes it so that it is now also called by circuit_clear_cell_queue.]
but now I'm wondering why we don't call that from the two cases in _circuit_mark_for_close() that send destroy cells.
I guess we could unlink the circuit from p_conn/n_conn as appropriate, but I am not sure we need to. What would the rationale be? When we call circuit_clear_cell_queue(), the circuit becomes inactive. It can't get any more cells added to it, because it's marked, so it can't become active, so it can't flush cells.
/me checks whether "It can't get any more cells added to it, because it's marked" is true.
Processing.
Processing.
Hm. This is some complicated stuff. It might be a good idea to add a check to append_cell_to_circuit_queue() so that we drop cells that anybody tries to queue on a marked circuit.
Also, I can't see a reason not to unlink the circuit just after sending the destroy. What Could Go Wrong?
"where the heck do we move cells from the circ cell queue onto the outbuf other than connection_or_flush_from_first_active_circuit() which seems to only move one cell and then stop"
That's the only place grepping for uses of cell_queue_pop, which is the only place that pulls a cell of a cell queue. The function doesn't "only move one cell and then stop", though. The "for (n_flushed = 0; n_flushed < max && queue->head; ) {" loop in the middle sends more than one cell. It stops when "max" cells have been flushed to the or_conn's outbuf, or when there are no more active circuits (that is, ones with cells to flush).
Ah ha. We call connection_or_flush_from_first_active_circuit() multiple times from connection_or_flushed_some() though. That's how we fill it up. Great.
More generally, are we way off the mark now for "important fix that has to get into 0.2.2.x"?
It seems these are all of the form "I bet it wouldn't break anything if we added a check to confirm foo, but as far as I can tell from this code, foo is already the case".
I don't know what you're referring to by "now" and "these". Do you mean bug 1184 as a whole, or the stuff starting with your "Geez" comment ?
If the latter, the one thing I can't prove from the code is actually the case is that cells should never get queued on a marked circuit. If you agree, then we should add a check/fix for that.
I feel like saying "trap" here -- we keep adding defensive programming lines because we're not certain what is actually going on, and in the future we'll never remember which ones we thought were redundant so we'll never be able to take any of them out.
I guess that means we should schedule this area as one of the prime needs for refactoring, in that glorious future when we have time and energy for refactoring.
I rather suspect that this check is necessary: append_cell_to_circuit_queue() is called by:
circuit_receive_relay_cell, which checks for marked_for_close.
circuit_package_relay_cell, which does not check, and is called by:
relay_send_command_from_edge, which does not check, and is called by:
'''Roughly 20 different places throughout the code'''
circuit_deliver_create_cell, which does not check, and is called by:
circuit_n_conn_done, which does check.
circuit_extend, which does not check, and is called by:
connection_edge_process_relay_cell, which is called by:
circuit_receive_relay_cell, which checks.
onionskin_answer, which does not check, and is called by:
command_process_create_cell(), which does not need to check since it creates the circuit in question.
connection_cpu_process_inbuf(), which checks implicitly by retrieving the circuit using circuit_get_by_circid_orconn().
So modulo the 20 callers of relay_send_command_from_edge, then yeah, I am reasonably certain that we don't need to add the check. But becoming reasonably certain about this is hard and time-consuming, and that itself suggests a problem in this part of our code: if I want to make sure that some X never happens if Y is true, I shouldn't have to trace all parts of the call graph that lead to X to make sure that Y is checked along each one. Putting a check at the position right before X is done is a better way to ensure that X never happens if Y than instituting a code-wide rule of "When you call the thing that does X, make sure you have checked Y, or every path that leads to you has checked Y."
But with the change in 87f18c95783, I'm never going to have to waste my time doing this analysis again. And isn't that nice?