Opened 3 months ago

Last modified 3 months ago

#29698 new defect

Edge case that causes improper circuit prioritization for one scheduling run

Reported by: pastly Owned by:
Priority: Medium Milestone: Tor: 0.4.1.x-final
Component: Core Tor/Tor Version: Tor:
Severity: Normal Keywords: tor-sched tor-cmux kist
Cc: robgjansen Actual Points:
Parent ID: Points:
Reviewer: Sponsor:


The Problem, Shortly

A circuit that goes from very busy for a long time, to 100% idle for a long time, and then needs to send traffic again will be incorrectly deprioritized the first time it gets scheduled.

The Problem, Illustrated

Consider a circuit that is very very busy for a significant length of time (minutes). There's constant traffic flowing in one (or both, but let's just say one) direction on this circuit, leading to it earning for itself a high cell_count EWMA value (thus a low priority for scheduling). Assume it is the only circuit on its channel.

Now assume it suddenly stops sending traffic but stays open. It stays this way for significant length of time (many 10s of seconds) such that its cell_count EWMA value should be essentially zero, but it hasn't actually been updated yet since this value isn't updated until a cell has been transmitted (see circuitmux_notify_xmit_cells).

At this point in time the relay is still servicing some number of low-traffic circuits on other channels. Maybe it has always been handling these circuits. Whatever. It doesn't matter. What matters is at this point in time there's lots of low-traffic circuits needing scheduling. Because they are low-traffic, these circuits have cell_count EWMA values that are relatively low (thus a high priority for scheduling).

Now what happens when that original high-traffic circuit stops being totally idle? What happens when it wants to send another 1000, 100, or even just 1 cell?

It gets put into KIST channels_pending smart list like any other circuit. In fact there are a bunch of low-bandwidth circuits in there with it. Observe what happens when KIST starts scheduling its collection of pending channels:

KIST loops over and over until its list of pending channels is empty. Each time it gets the channel with the current best-priority circuit, schedules one cell, updates the appropriate cell_count, and puts the channel back in the pending list if necessary.

All those low-traffic circuits will be serviced first because they have low cell_count values (high priority) as compared to the outdated cell_count value for the original high-traffic circuit.

When the circuit finally gets to send its first cell after its long period of inactivity, its cell_count EWMA value is corrected to be near zero. That's fine. But it should have been updated before scheduling decisions were made so that it was the first one to be scheduled.

A solution

Add a touch function in the circuitmux channel interface that tells the circuitmux and whatever its policy is to update its circuit priorities if desired.

Before entering the main scheduling loop, call this touch function on all the pending channels. In the case of the EWMA policy, the touch function would ultimately drill down to something like

ewma_touch(circuitmux_policy_data_t *pol_data)
  ewma_policy_data_t *pol = NULL;
  unsigned int tick;
  double fractional_tick;
  pol = TO_EWMA_POL_DATA(pol_data);

  /* Rescale the EWMAs if needed */
  tick = cell_ewma_get_current_tick_and_fraction(&fractional_tick);

  if (tick != pol->active_circuit_pqueue_last_recalibrated) {
    scale_active_circuits(pol, tick);

(Which you might observe is essentially the first part of ewma_notify_xmit_cells(...)).

Child Tickets

Attachments (1)

EWMA vs time with and without touch.png (144.6 KB) - added by pastly 3 months ago.

Download all attachments as: .zip

Change History (3)

comment:1 Changed 3 months ago by pastly

Currently KIST will always handle every pending channel. It will always send at least one cell, thereby updating the EWMA values. Technically a bug, but not super bad IMHO.

But if KIST were to ever stop handling every pending channel -- for example if a global write limit were added on the maximum number of cells it is allowed to send over all channels --, then channels could "get forgotten about."

I'm attaching two graphs in a single image that show a single channel's best EWMA value over time. In both graphs, the middle 120 seconds takes place when a global write limit and extra load are applied to the relay.

The left graph is without the touch function. The channel we are looking at gets "forgotten" during the middle 120 seconds. No cells are sent on it during this time, and its EWMA value never updates. It's stuck. Once the load and write limit are removed, the channel gets to send and its EWMA value is updated correctly.

The right graph adds the touch function. The channel gets its EWMA value updated periodically like all other channels, and it does get its fair chance to send data. The channel is not "forgotten."

Changed 3 months ago by pastly

comment:2 Changed 3 months ago by dgoulet

Keywords: kist added
Milestone: Tor: 0.3.5.x-finalTor: 0.4.1.x-final

Fortunately for us, the fix seems simple enough. Moving it to 041 since we can address this without massive engineering.

Note: See TracTickets for help on using tickets.