Opened 3 years ago

Last modified 8 months ago

#23712 new defect

sched: DESTROY cell on a circuit bypasses the scheduler

Reported by: dgoulet Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-sched, 034-triage-20180328, 034-removed-20180328
Cc: Actual Points:
Parent ID: #23993 Points:
Reviewer: Sponsor:


If you look at circuitmux_append_destroy_cell(), it is the one appending a DESTROY cell to the cmux queue and then calls channel_flush_from_first_active_circuit() if no writes are pending that is if the outbuf is empty (also looks at the out queue but that is always empty #23709).

In the case the flush is triggered, the cell is immediately put in the outbuf and written to kernel by libevent which completely bypasses the scheduler. Maybe it is what we want that is go as fast as we can in destroying a circuit? Don't know but it has this effect on the scheduler where the channel is scheduled with a "wants_to_write" event from the connection subsystem and ultimately the channel gets scheduled with nothing in the queue because it is already on the outbuf. For KIST, this is not ideal because KIST should control the flow of data to the kernel.

It seems there are two places we queue cells into a cmux queue: circuitmux_append_destroy_cell() and append_cell_to_circuit_queue(). The latter triggers a "has waiting cells" for the scheduler which is what we want but the former just bypasses it.

I think it should simply trigger that notify to the scheduler instead of flushing it by itself.

Child Tickets

Change History (21)

comment:1 Changed 3 years ago by dgoulet

Status: newneeds_review

See branch: bug23712_032_01.

Ok this might look like a very simple fix but it is really not.

For KIST, this is what we _need_ to do else this process of appending a DESTROY cell bypasses the scheduler and can trigger KIST to try to flush an empty channel which could be one of the source of #23676. The following lists the possible channel state it can be when the destroy cell is appended and the scheduler notified:

  1. if in WAITING_FOR_CELLS, all good, we will get scheduled
  2. KIST can't let unscheduled channel in WAITING_TO_WRITE so we can't be calling "waiting for cells" on a "waiting to write" channel.
  3. KIST doesn't put channel in IDLE state
  4. PENDING state, can't escape the scheduler, like (ii).

For the Vanilla kernel:

  1. if in WAITING_FOR_CELLS, all good, we will get scheduled
  2. if in WAITING_TO_WRITE, it means we have data in the queue but can't put them in the outbuf because it is full. The flushed_some() function will call the "wants to write" event when the outbuf is flushed to the network and thus the DESTROY cell will ultimately get drained.
  3. PENDING state, can't escape the scheduler.
  4. if in the IDLE state, it means we have no more cell in the queue and the outbuf is full, the channel is put in WAITING_TO_WRITE state so (ii) applies to this case that is the "wants to write" event will trigger a flush of the destroy cell at some point when the outbuf is drained.

comment:2 Changed 3 years ago by dgoulet

One quick note here. When a DESTROY cell is sent, it is from circuit_about_to_free() which means that the "circuit_t" will get freed momentarily. However, we put a cell on the circuit queue for which we should be updating our cmux cell count (circuitmux_set_num_cells()) but that function takes a circuit object that we can't use... The circ ID is available but we can't do much with it with the current interface.

That is "OK" because for the scheduler to know if more needs to be flushed, it uses circuitmux_num_cells() which counts the cell + destroy cell.

comment:3 Changed 3 years ago by nickm

Owner: set to dgoulet
Status: needs_reviewassigned

setting owner

comment:4 Changed 3 years ago by nickm

Status: assignedneeds_review

comment:5 Changed 3 years ago by nickm

Keywords: review-group-24 added

review-group-24 is now open.

comment:6 Changed 3 years ago by nickm

Milestone: Tor: 0.3.2.x-finalTor: 0.3.3.x-final

deferring after talking with dgoulet: this probably isn't critical for 0.3.2

comment:7 Changed 3 years ago by dgoulet

Sponsor: SponsorV

Tagging all scheduler (and KIST) related tickets for SponsorV

comment:8 Changed 3 years ago by nickm

Status: needs_reviewmerge_ready

I'm okay with this change, but first: has this been tested enough to make sure that these properties hold?

1) In the case where the channel has no currently queued cells, the DESTROY cell should get sent soon, and not wait for some other activity to trigger the scheduler.
2) DESTROY cells should really get sent.

It looks to me like they should, given the code, but it would be good to be sure, since these are the kind of issues where if we got them wrong, we might not notice for a while.

comment:9 Changed 3 years ago by dgoulet

Many things needs to be confirmed in terms of cell ordering properties and assumptions from channels to the scheduler.

I plan to do this as part of a broader set of changes that needs to happen since KIST has been added. I'm scheduled by November to do this so expect it in 033.

comment:10 Changed 3 years ago by dgoulet

Parent ID: 23993

comment:11 Changed 3 years ago by dgoulet

Parent ID: 23993#23993

comment:12 Changed 3 years ago by dgoulet

Status: merge_readyneeds_revision

Not merge_ready at all... I thought I had change its status!...

comment:13 Changed 3 years ago by dgoulet

Keywords: review-group-24 removed
Milestone: Tor: 0.3.3.x-finalTor: 0.3.4.x-final
Status: needs_revisionaccepted

The proposed branch won't work properly so putting this one back in accepted state. The solution is much more involving and possibly requires a bit of research. Moving this one to 034 as there is little chance to fix this in 033.

comment:14 Changed 3 years ago by nickm

Keywords: 034-triage-20180328 added

comment:15 Changed 3 years ago by nickm

Keywords: 034-removed-20180328 added

Per our triage process, these tickets are pending removal from 0.3.4.

comment:16 Changed 3 years ago by nickm

Milestone: Tor: 0.3.4.x-finalTor: unspecified

These tickets, tagged with 034-removed-*, are no longer in-scope for 0.3.4. We can reconsider any of them, if time permits.

comment:17 Changed 17 months ago by gaba

Removing sponsor V as we do not have more time to include this tickets in the sponsor.

comment:18 Changed 17 months ago by gaba

Sponsor: SponsorV

Removing sponsor from tickets that we do not have time to fit in the remain of this sponsorship.

comment:19 Changed 17 months ago by gaba

Owner: dgoulet deleted
Status: acceptedassigned

dgoulet will assign himself to the ones he is working on right now.

comment:20 Changed 10 months ago by dgoulet

Reviving this. Things have changed since the initial report 2 years ago. But things are still not good:

  1. circuitmux_append_destroy_cell() only triggers a flush with channel_flush_from_first_active_circuit if the channel has no queued writes which translates to "Do we have data in the outbuf?".

This is wrong. Lets assume we do have data in the outbuf, then the flush is not triggered but then it means the channel has to become active to be considered for a flush. And the only way this can happen is if at least one cell is put on it.

Which means that a DESTROY cell can stay in the queue of the channel for an arbitrary amount of time without being sent.

  1. The second problem is that calling channel_flush_from_first_active_circuit() does _not_ guarantee that the DESTROY cell is sent.

So circuitmux_append_destroy_cell() is really not doing the right thing in many ways.


I think the initial branch has it correct bug23712_032_01 that is we simply need to callback the scheduler telling it that the channel has waiting cells and be done.

The comment:13 I believe is refers to the fact that it won't fix the full priority problem for relay cells vs destroy cells but should fix the issue of lingering DESTROY cells.

comment:21 Changed 8 months ago by teor

Status: assignednew

Change tickets that are assigned to nobody to "new".

Note: See TracTickets for help on using tickets.