Sending useless relay cells after the destroy cell

changed milestone to %Tor: 0.2.2.x-final

added component::core tor/tor milestone::Tor: 0.2.2.x-final priority::low resolution::fixed status::closed type::defect version::0.2.2.6-alpha labels

Trac:
Milestone: 0.2.2.x-final to Tor: 0.2.2.x-final
Parent: N/A to N/A
Keywords: N/A deleted, N/A added

So if the destroy cell is outgoing, the client really won't process any cells on the circuit, so it's safe for the client to clear the queue. And if it's incoming at a relay, it's going to turn into a truncated, and get queued up behind all the other relay cells for the client. And if it's incoming at a client, there is currently no cell queue there at all, since we crypt cells aggressively on receiving them. So I am pretty sure that clearing cell queues on circuit_mark_for_close is a safe thing to do. I don't think we were sending any cells in these cases, but if we were, they weren't getting processed.

The one case that puzzles me is RELAY_COMMAND_TRUNCATE. We call connection_or_send_destroy() there. But do we really mean to cancel every relay cell that the intermediate relay received before the TRUNCATE command but has not yet relayed, or do we mean them to go on their way? I think it's safe to cancel them given how Tor works now, but our spec doesn't envision that possibility.

Perhaps we should check, in main.c when we're considering whether to mark it because it's idle, whether there are any cells still in its queue.

You were talking about connections, perhaps? I don't see anywhere in main.c where we close idle circuits. But connections don't have cell queues; circuits do.

Anyway, I've hacked up a fix. See branch bug1184 in my public repository.

Trac:
Status: new to needs_review
Description: r9913 took out the "clear cell queue after sending destroy cell" logic.

Bug 1150 (comment 3519) points out that the cells are just going to get dropped by the other side anyway.

I think it's right.

There are three places that call connection_or_send_destroy().

The first is in relay.c, where we unlink the circ from the n_conn immediately after.

The second is in command.c, where we refuse new circuits for various reasons. No cells queued yet.

The third is in circuitlist.c, during circuit-mark-for-close. So the question is, if the circuit is marked, do we really want to still be trying to send its cells? Either it got marked from the other side, in which case the other side has already decided it doesn't want to hear from us. Or it got marked from our side, in which case either that was an error (circuit needs to die) or it has no streams on it, in which case there's no harm in clearing its queue.

Did I miss anything?

Perhaps we should check, in main.c when we're considering whether to mark it because it's idle, whether there are any cells still in its queue.

Or maybe we should just pass an argument into connection_or_send_destroy() from the truncate case in relay.c saying "clear the queue please". Then we can remain ambiguous in the cases above where I said "no harm because surely the queue has nothing in it". That's certainly the safest approach, but wouldn't it be nice to simplify rather than complexify? :)

[Automatically added by flyspray2trac: Operating System: All]

to

r9913 took out the "clear cell queue after sending destroy cell" logic.

Bug 1150 (comment 3519) points out that the cells are just going to get dropped by the other side anyway.

I think it's right.

There are three places that call connection_or_send_destroy().

The first is in relay.c, where we unlink the circ from the n_conn immediately after.

The second is in command.c, where we refuse new circuits for various reasons. No cells queued yet.

The third is in circuitlist.c, during circuit-mark-for-close. So the question is, if the circuit is marked, do we really want to still be trying to send its cells? Either it got marked from the other side, in which case the other side has already decided it doesn't want to hear from us. Or it got marked from our side, in which case either that was an error (circuit needs to die) or it has no streams on it, in which case there's no harm in clearing its queue.

Did I miss anything?

Perhaps we should check, in main.c when we're considering whether to mark it because it's idle, whether there are any cells still in its queue.

Or maybe we should just pass an argument into connection_or_send_destroy() from the truncate case in relay.c saying "clear the queue please". Then we can remain ambiguous in the cases above where I said "no harm because surely the queue has nothing in it". That's certainly the safest approach, but wouldn't it be nice to simplify rather than complexify? :)

[Automatically added by flyspray2trac: Operating System: All]

That branch doesn't seem to contain the right code, instead it is the fix for #1195 (moved)

Oops. Should be fixed now.

wanoskarnet pointed out on IRC that the behavior on TRUNCATE cells makes the network fragile if anybody uses TRUNCATE. Unfortunately, this isn't new behavior with this patch: the network has operated this way since cell queues were introduced. To make TRUNCATE work right will be another (orthogonal) bugfix. Since nobody's using TRUNCATE after streams now AFAICT, the fix can probably wait till 0.2.3.x.

I've edited the spec in branch bug1184 more to clarify that this isn't working-as-intended.

See bug #1784 (moved) for TRUNCATE issue.

The two patches leave me confused. The spec edit should probably be squashed into the earlier commit and reworded, and now I wonder why we want to patch this at all until 1784 is resolved. What are we actually fixing with this patch?

What are we actually fixing with this patch?

See the commit message on the first of the patches:

At best, this patch helps us avoid sending queued relayed cells that
would get ignored during the time between when a destroy cell is
sent and when the circuit is finally freed.  At worst, it lets us
release some memory a little earlier than it would otherwise.

Right, I read that. What I mean is, does this actually buy us anything useful in the current network? I was thinking that probably we would want to fix 1784 which is likely to change our behavior here again?

Well, for all cases except the TRUNCATE case, it (might) improve conformance with the spec, which frowns on sending any cells once we've gotten a destroy. Even after fixing 1784, we won't change anything in the non-TRUNCATE cases.

In the TRUNCATE case, the patch (maybe) trades a severely nonconformant broken behavior for a less visibly nonconformant broken behavior. Currently, on a TRUNCATE, we send a DESTROY, maybe send a couple of queued cells after the destroy (bug 1184, fixed here), and drop queued cells when we close the circuit (bug 1784). This patch fixes the "We might send cells on a destroyed circuit" problem, and leaves the "TRUNCATE destroys the circuit too early" problem for later.

Ok thanks! Makes sense now. I've pushed a patch to bug1184 in my repo that changes the tor-spec wording to make more sense to me. Some native english speaker please review. We probably want to squash the three commits into one before merging.

Merged your bug1184 patch onto my bug1184 branch

+   [Note: If an OR receives a TRUNCATE cell and it has any RELAY cells
+   still queued on the circuit for the next node that it had not yet sent,

s/ that it had not yet sent//

+   it will drop them without sending them.  This is not considered
+   conformant behavior, but it probably won't get fixed till a later

s/till/until/ (unless we all want to switch over to older English? :)

+   version of Tor.  Thus, clients SHOULD NOT send a TRUNCATE cell to a
+   node running any current version of Tor if they have sent relay cells
+   through that node, and they aren't sure whether those cells have been
+   sent on.]

...if a) they have sent relay cells through that node, and b) they aren't sure whether those cells have been sent on yet.

Geez. I just tried to go answer the question "is it the case that when we try to move a cell from circ's cell queue onto the outbuf, we notice that circ->n_conn is null and choose not to send the cell", which moved to "where the heck do we move cells from the circ cell queue onto the outbuf other than connection_or_flush_from_first_active_circuit() which seems to only move one cell and then stop", which moved to "are there any cases where an inactive circuit moves cells to the outbuf" (I think no), which moved to "at what point in closing a circuit do we clear its cell queue" (I guess when we're freeing it and not before).

Then I noticed in connection_edge_process_relay_cell() that we call

circuit_set_n_circid_orconn(circ, 0, NULL);

after the destroy cell (sounds good), but now I'm wondering why we don't call that from the two cases in _circuit_mark_for_close() that send destroy cells. Is it because in those two cases the circuit is going to get marked, which is just as good as setting its n_circid and n_conn to 0,null, for some reason I haven't found yet?

What a mess. :/

"is it the case that when we try to move a cell from circ's cell queue onto the outbuf, we notice that circ->n_conn is null and choose not to send the cell"

See below....

"where the heck do we move cells from the circ cell queue onto the outbuf other than connection_or_flush_from_first_active_circuit() which seems to only move one cell and then stop"

That's the only place grepping for uses of cell_queue_pop, which is the only place that pulls a cell of a cell queue. The function doesn't "only move one cell and then stop", though. The "for (n_flushed = 0; n_flushed < max && queue->head; ) {" loop in the middle sends more than one cell. It stops when "max" cells have been flushed to the or_conn's outbuf, or when there are no more active circuits (that is, ones with cells to flush).

So, to answer the first question above: the process of moving cells to an orconn outbuf is entirely driven by the orconn looking through its list of active circuits, and not by doing anything to the circuit, it's not possible that a null n_conn on the circ will get followed.

"are there any cases where an inactive circuit moves cells to the outbuf" (I think no)

I agree.

"at what point in closing a circuit do we clear its cell queue" (I guess when we're freeing it and not before).

Right. cell_queue_clear() is called in circuit_free and nowhere else. [This branch changes it so that it is now also called by circuit_clear_cell_queue.]

but now I'm wondering why we don't call that from the two cases in _circuit_mark_for_close() that send destroy cells.

I guess we could unlink the circuit from p_conn/n_conn as appropriate, but I am not sure we need to. What would the rationale be? When we call circuit_clear_cell_queue(), the circuit becomes inactive. It can't get any more cells added to it, because it's marked, so it can't become active, so it can't flush cells.

/me checks whether "It can't get any more cells added to it, because it's marked" is true.

Processing.

Hm. This is some complicated stuff. It might be a good idea to add a check to append_cell_to_circuit_queue() so that we drop cells that anybody tries to queue on a marked circuit.

Also, I can't see a reason not to unlink the circuit just after sending the destroy. What Could Go Wrong?

Replying to nickm:

"where the heck do we move cells from the circ cell queue onto the outbuf other than connection_or_flush_from_first_active_circuit() which seems to only move one cell and then stop"

That's the only place grepping for uses of cell_queue_pop, which is the only place that pulls a cell of a cell queue. The function doesn't "only move one cell and then stop", though. The "for (n_flushed = 0; n_flushed < max && queue->head; ) {" loop in the middle sends more than one cell. It stops when "max" cells have been flushed to the or_conn's outbuf, or when there are no more active circuits (that is, ones with cells to flush).

But max is always 1.

But max is always 1.

Ah, right you are.

Replying to nickm:

But max is always 1.

Ah, right you are.

Ah ha. We call connection_or_flush_from_first_active_circuit() multiple times from connection_or_flushed_some() though. That's how we fill it up. Great.

More generally, are we way off the mark now for "important fix that has to get into 0.2.2.x"?

It seems these are all of the form "I bet it wouldn't break anything if we added a check to confirm foo, but as far as I can tell from this code, foo is already the case".

I don't know what you're referring to by "now" and "these". Do you mean bug 1184 as a whole, or the stuff starting with your "Geez" comment ?

If the latter, the one thing I can't prove from the code is actually the case is that cells should never get queued on a marked circuit. If you agree, then we should add a check/fix for that.

Added a check for closed circuit in append_cell_to_circuit_queue(), and made requested changes in tor-spec.txt. Okay to merge?

Sure, go for it.

I feel like saying "trap" here -- we keep adding defensive programming lines because we're not certain what is actually going on, and in the future we'll never remember which ones we thought were redundant so we'll never be able to take any of them out.

I guess that means we should schedule this area as one of the prime needs for refactoring, in that glorious future when we have time and energy for refactoring.

Let me explain in detail.

I rather suspect that this check is necessary: append_cell_to_circuit_queue() is called by:

circuit_receive_relay_cell, which checks for marked_for_close.
circuit_package_relay_cell, which does not check, and is called by:
- relay_send_command_from_edge, which does not check, and is called by:
  - '''Roughly 20 different places throughout the code'''
circuit_deliver_create_cell, which does not check, and is called by:
- circuit_n_conn_done, which does check.
- circuit_extend, which does not check, and is called by:
  - connection_edge_process_relay_cell, which is called by:
    - circuit_receive_relay_cell, which checks.
onionskin_answer, which does not check, and is called by:
- command_process_create_cell(), which does not need to check since it creates the circuit in question.
- connection_cpu_process_inbuf(), which checks implicitly by retrieving the circuit using circuit_get_by_circid_orconn().

So modulo the 20 callers of relay_send_command_from_edge, then yeah, I am reasonably certain that we don't need to add the check. But becoming reasonably certain about this is hard and time-consuming, and that itself suggests a problem in this part of our code: if I want to make sure that some X never happens if Y is true, I shouldn't have to trace all parts of the call graph that lead to X to make sure that Y is checked along each one. Putting a check at the position right before X is done is a better way to ensure that X never happens if Y than instituting a code-wide rule of "When you call the thing that does X, make sure you have checked Y, or every path that leads to you has checked Y."

But with the change in 87f18c95783, I'm never going to have to waste my time doing this analysis again. And isn't that nice?

(Merged, so closing. Feel free to comment more if you think the check in 87f18c95783 should work different.)

Trac:
Status: needs_review to closed
Resolution: None to fixed

Trac:
Component: Tor Client to Tor

closed

mentioned in issue #1784 (moved)

moved to tpo/core/tor#1184 (closed)

mentioned in issue tpo/core/tor#1784 (closed)

Sending useless relay cells after the destroy cell

Child items ...

Activity