Edge case that causes improper circuit prioritization for one scheduling run

Trac:
Child Ticket(s): #32196 (moved)

added 041-deferred-20190530 042-deferred 043-deferred BugSmashFund actualpoints::0.2 component::core tor/tor kist milestone::Tor: unspecified network-team-roadmap-2020Q1 owner::dgoulet points::0.2 priority::medium reviewer::nickm severity::normal status::needs-revision tor-cmux tor-sched type::defect version::tor 0.3.2.1-alpha labels

Currently KIST will always handle every pending channel. It will always send at least one cell, thereby updating the EWMA values. Technically a bug, but not super bad IMHO.

But if KIST were to ever stop handling every pending channel -- for example if a global write limit were added on the maximum number of cells it is allowed to send over all channels --, then channels could "get forgotten about."

I'm attaching two graphs in a single image that show a single channel's best EWMA value over time. In both graphs, the middle 120 seconds takes place when a global write limit and extra load are applied to the relay.

The left graph is without the touch function. The channel we are looking at gets "forgotten" during the middle 120 seconds. No cells are sent on it during this time, and its EWMA value never updates. It's stuck. Once the load and write limit are removed, the channel gets to send and its EWMA value is updated correctly.

The right graph adds the touch function. The channel gets its EWMA value updated periodically like all other channels, and it does get its fair chance to send data. The channel is not "forgotten."

Trac:

Fortunately for us, the fix seems simple enough. Moving it to 041 since we can address this without massive engineering.

Trac:
Milestone: Tor: 0.3.5.x-final to Tor: 0.4.1.x-final
Keywords: tor-sched tor-cmux deleted, tor-sched tor-cmux kist added

Marking these tickets as deferred from 041.

Trac:
Keywords: tor-sched tor-cmux kist deleted, tor-sched, kist, tor-cmux, 041-deferred-20190530 added

Trac:
Milestone: Tor: 0.4.1.x-final to Tor: 0.4.2.x-final

Trac:
Keywords: N/A deleted, 042-should added

Distributing 0.4.2 tickets between network team members.

Trac:
Status: new to assigned
Owner: N/A to dgoulet

PR: https://github.com/torproject/tor/pull/1402 Branch: ticket29698_042_01

Trac:
Actualpoints: N/A to 0.2
Status: assigned to needs_review
Points: N/A to 0.2
Keywords: N/A deleted, BugSmashFund added
Reviewer: N/A to nickm

Looks reasonable; I've left some comments on the patch.

One more thing I'd request: Could we please have a regression test on this? That is, a test that fails without this patch, and passes with it, to demonstrate that we really fixed the problem.

Also, is this a backport candidate?

Thanks!

Trac:
Status: needs_review to needs_revision

Replying to nickm:

Looks reasonable; I've left some comments on the patch.

One more thing I'd request: Could we please have a regression test on this? That is, a test that fails without this patch, and passes with it, to demonstrate that we really fixed the problem.

I've spent a bit of time trying to come up with a way to test this. And I sorta fail in a reasonable time frame. It would hitting so many code path in the cmux/ewma/channel that it would require quite a bit of work to set that test setup.

We have virtually no EWMA or cmux unit tests so everything needs to be done in mocking and what not to be able to do something proper. Also, the cmux and EWMA code is opaque, even to the tests. Thus more work there to expose what we need to test.

All in all, I can do something there but there is a potential for a much larger patch to support everything needed for testing. And this will turn the ticket into a much larger Point size.

Still OK with this?

The other approach I guess would be to allocate myself some cycles and add considerable amount of testing over the entire cmux/EWMA code. I would expect that to be around 2 or 3 days of work for something proper which would include this touch() interface.

Also, is this a backport candidate?

Nope. It is not critical enough to justify a backport imo.

Discussion on IRC with nickm is that I'll spend some cycle (couple days max) to add test coverage to the cmux subsystem so we can at least have a baseline to work with for future changes such as this one.

If it turns out to be a bit to aggressive code wise, we'll then aim for 043 including this patch.

Considering the size of the patch for unit tests, we'll aim for 043 instead. This can't be merged before #32196 (moved) is merged.

Trac:
Milestone: Tor: 0.4.2.x-final to Tor: 0.4.3.x-final

Trac:
Keywords: 042-should deleted, 042-deferred added

Now with a unit test :).

Branch: ticket29698_043_01 PR: https://github.com/torproject/tor/pull/1494

Trac:
Status: needs_revision to needs_review

Trac:
Status: needs_review to needs_revision

Trac:
Keywords: BugSmashFund deleted, BugSmashFund network-team-roadmap-2020Q1 added

Trac:
Keywords: BugSmashFund network-team-roadmap-2020Q1 deleted, BugSmashFund, network-team-roadmap-2020Q1 added

Edge case that causes improper circuit prioritization for one scheduling run

Child items ...

Activity