Opened 9 years ago

Closed 2 years ago

#1826 closed defect (worksforme)

Sending packets after RELAY_END

Reported by: mwenge Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: unspecified
Severity: Normal Keywords: tor-relay protocol needs-diagnosis bandwidth
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

If you run Camilo Viecco's test program at:

svn checkout http://tdor.googlecode.com/svn/trunk/testing/

you'll notice that once the program cuts off the downloads at 30 seconds or so, Tor continues to report hefty incoming bandwidth.

The attached dump from Wireshark shows that long after the client has sent a RELAY_END cell for stream 11629 it continues to receive data for that stream from the exit. A rough calculation suggests about 30K worth of data is received after the RELAY_END.

I think this is the exit flushing it's cell queue, but given the client officially stopped caring after RELAY_END that's quite a bit of wasted bandwidth.

(To read the attached dump search for 'RELAY_END' and keep searching until you find the RELAY_END for stream 11629. Everything after that is the exit flushing unwanted cells at the client.)

Child Tickets

Attachments (1)

cells-bug1826.tar.bz2 (117.0 KB) - added by mwenge 9 years ago.
Exit log

Download all attachments as: .zip

Change History (23)

comment:1 Changed 9 years ago by mwenge

comment:2 Changed 9 years ago by mwenge

Component: - Select a componentTor Relay
Milestone: Tor: 0.2.2.x-final
Version: Tor: unspecified

comment:3 Changed 9 years ago by nickm

Interesting!

If these cells are getting sent after the RELAY_END arrives at an exit node, they are probably coming because [case 1] they're already in the cell queue or [case 2] in the connection_or's outbuf. Unless [case 3] data in the connection_edge's inbuf is somehow getting packaged even after a the RELAY_END cell arrives.

It's also possible [case 4] that some of the cells are already in transit (queued somewhere between the exit node's SSL and the client's Tor) when the client sends the RELAY_END cell. We can't really do anything about those.

So, breaking down the cases:
case 1: cells already on the cell_queue) We don't have a good way to remove cells from a circuit's cell queue now at all. By the time they get there, they're encrypted and ready for delivery. Removing them would throw of the circuit's stream cipher. Furthermore, the cells are already packed at that point, so figuring out what streams they went with would be nontrivial.

We could mitigate this by lowering how full we're willing to fill a cell queue with cells from streams before we stop reading from the streams. This might help stream fairness. I don't know what it would do to performance.

In Tor 0.2.3.x, we're going to have to change the point at which we encrypt cells on queues in order to implement #1479. When we do this, we can revisit the possibility of removing already-queued cells for a dead stream.

case 2: cells already in the connection_or's outbuf) This is way too late to remove the cells.

We could mitigate this by lowering how much we're willing to fill an connection_or's outbuf.

case 3: packaging cells from an exit connection even after getting a relay_end) That would be a bug. If it's happening, we should fix it.

case 4: the cells are in transit on the net) There's nothing we can do about that with our current protocol.

So IMO the right thing to do is make sure that case 3 isn't happening, and think about what tuning we could do to mitigate 1 and 2, and whether it's a net win or not.

comment:4 Changed 9 years ago by nickm

Oops. I should have said #1749 , not #1479 .

comment:5 Changed 9 years ago by mwenge

I think this might be a useful clue:

The total number of RELAY_DATA cells received for the three streams combinaed, after a RELAY_END for each had been sent, is exactly 1000.

I believe that's the size of the cell_queue for a circuit - the spec calls it a 'window', I assume that's what it's referring to.

I think this tells us that the exit used up the entire cell_queue and only stopped sending when it received no further SEND_MEs for the individual streams.

However, even though the client had sent RELAY_ENDs for each of the streams, as it continued to receive cells on the circuit it continued to send SENDMEs with a Circuit ID of zero every hundred of the unwanted RELAY_DATA cells it received on the circuit. Even though it had no active streams on the circuit.

If my understanding of the spec is correct, had it not send those SENDMEs the exit might have stopped pumping the unwanted RELAY_DATAs sooner. The client could have observed it had no active streams on the circuit and sent no SENDMEs at all.

I should be able to test this theory here.

comment:6 Changed 9 years ago by arma

Interesting. My hope is that the exit relay packages and sends 1000 cells, then it receives the stream end cells, *then* it receives the extra sendmes you're talking about. If so, it wouldn't send any more cells for the streams that it had already ended.

The general problem you describe is known to happen when you're on a slow link, you click on a web page, and then you click stop soon after. You've already set in motion the receiving of that page, and your Tor will waste bandwidth receiving the cells and discarding them since they're for an unrecognized stream.

We could reduce the impact if we reduce the default circuit package window, which we can control from the consensus: see circuit_initial_package_window() in circuitlist.c. But doing so may hurt overall throughput, just like capping tcp's window size will hurt throughput.

As Nick says, it would be good to confirm that we're not sending any further stream cells *after* we get the stream end cell.

I'd expect the cells to be long gone (case 4) in general. I guess for a really slow exit relay (or a really slow middle hop), that might still be in queues in the exit relay.

comment:7 in reply to:  6 ; Changed 9 years ago by mwenge

Replying to arma:

Interesting. My hope is that the exit relay packages and sends 1000 cells, then it receives the stream end cells, *then* it receives the extra sendmes you're talking about. If so, it wouldn't send any more cells for the streams that it had already ended.

On reflection, that's exactly what it appears to be doing. It receives the RELAY_END cells and send exactly 1000 more cells along the circuit - it keeps pumping until the window is used up. Withholding SENDMEs at that stage of the game has no effect in the testing I've done. And I think that proves the exit is just flushing what it had already written to the queue.

This problem doesn't seem to happen on the one or two fast circuits I've watched. Which reinforces the likelihood that case 4 is not happening.

comment:8 in reply to:  7 Changed 9 years ago by mwenge

Replying to mwenge:

This problem doesn't seem to happen on the one or two fast circuits I've watched. Which reinforces the likelihood that case 4 is not happening.

Sorry, I meant case 3 above.

comment:9 Changed 9 years ago by mwenge

After a little more cell-counting I can see that the total number of RELAY_DATA cells sent for each stream is roughly equivalent to the number of stream-level SENDMEs + 500 (the size of the stream-level package window).

So that seems sane. However, what still puzzles me is why the total number of RELAY_DATA cells sent on the circuit after each stream has sent a RELAY_END is always 1000 cells exactly - i.e. the size of the circuit-level package window. Maybe that will make sense tomorrow. I'm obviously not clear on how the two windows interact.

comment:10 in reply to:  9 ; Changed 9 years ago by mwenge

Replying to mwenge:

So that seems sane. However, what still puzzles me is why the total number of RELAY_DATA cells sent on the circuit after each stream has sent a RELAY_END is always 1000 cells exactly - i.e. the size of the circuit-level package window. Maybe that will make sense tomorrow. I'm obviously not clear on how the two windows interact.

Duh - the circuit package window is the upper limit window for the circuit so that's why there will always be at most 1000 cells remaining to flush to the client after a RELAY_END rather than the combined value of the three stream package windows (1500).

To close this out I will follow Nick's suggestion and perform this test with my Tor acting as both client and exit and ensure the cell queue isn't being filled after the RELAY_END. I really don't expect this to be happening.

comment:11 in reply to:  10 Changed 9 years ago by mwenge

Replying to mwenge:

To close this out I will follow Nick's suggestion and perform this test with my Tor acting as both client and exit and ensure the cell queue isn't being filled after the RELAY_END. I really don't expect this to be happening.

I tested it with an exit node and client running as separate Tor instances on the same machine and by the time the exit receives the RELAY_ENDs its cell queue for the circuit is completely full, hence the 1000 trailing cells the client receives after sending its last RELAY_END. There's nothing the exit can do to save bandwidth here except tear down the or_conn if it establishes all the streams on all the circuits multiplexed on that or_conn have received a RELAY_END. I tried this in my testing too, and it works well enough - but there is no pretty way of destroying an or_conn without flushing all the bytes on it first so I'm not sure if such an approach is acceptable, even if it's worthwhile.

I'm going to work with Karsten to implement a metric for counting the number of bytes/cells exits flush to circuits with no active streams.

I've attached the exit's log from the test.

Changed 9 years ago by mwenge

Attachment: cells-bug1826.tar.bz2 added

Exit log

comment:12 Changed 9 years ago by mwenge

The best way to read the log is to search for 'removing stream', find the last instance of it, and work back up through the cell packaging messages. I added some extra info to the log messages so you can track the package_window for individual streams.

One thing that struck me was the number of cases where the client sends a stream-level SENDME when the exit's stream level package window is greater than zero but the circ level package_window is zero so no bytes can be packaged.

comment:13 Changed 9 years ago by nickm

So, is there a bug here that should be fixed for 0.2.2.x, or should I take it out of the milestone?

comment:14 in reply to:  13 Changed 9 years ago by mwenge

Replying to nickm:

So, is there a bug here that should be fixed for 0.2.2.x, or should I take it out of the milestone?

There's no bug - you can take it out of the milestone. There's some potential for saving bandwidth so will leave it open to see what adding a metric tells us:

https://gitweb.torproject.org/karsten/tor.git/shortlog/refs/heads/bug1826-metrics

comment:15 Changed 9 years ago by nickm

Milestone: Tor: 0.2.2.x-final

okay; removing from mileston

comment:16 Changed 9 years ago by nickm

Milestone: Tor: 0.2.3.x-final

comment:17 Changed 8 years ago by nickm

Milestone: Tor: 0.2.3.x-finalTor: unspecified

comment:18 Changed 7 years ago by nickm

Keywords: tor-relay added

comment:19 Changed 7 years ago by nickm

Component: Tor RelayTor

comment:20 Changed 2 years ago by nickm

Keywords: protocol needs-diagnosis bandwidth added
Severity: Normal

comment:21 Changed 2 years ago by arma

I don't know what further diagnosis this one needs. If you send a stream begin, then data cells start coming at you. If you then send a stream end, the ones that are coming at you continue and then arrive.

I don't think anything in this ticket has shown that the exit relay queues more data cells for the client after it has received the end cell?

I think this one might be a works-for-me, since Tor is doing what we designed it to do. Maybe somebody should produce a better design, but that's a different ticket. :)

comment:22 Changed 2 years ago by nickm

Resolution: worksforme
Status: newclosed

Closing as worksforarma ;)

Note: See TracTickets for help on using tickets.