Opened 7 months ago

Closed 7 months ago

Last modified 7 months ago

#20459 closed defect (fixed)

ewma_cmp_cmux never considers policies different

Reported by: pastly Owned by:
Priority: High Milestone: Tor: 0.3.0.x-final
Component: Core Tor/Tor Version: Tor: 0.2.6.2-alpha
Severity: Normal Keywords: 029-backport, review-group-11
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Looks like a copy/paste bug from 700d6e75. See line 502.

The earliest branches I see this commit in are {maint,release}-0.2.6.

Branch incoming once I get a ticket number.

Child Tickets

Attachments (2)

qtime.shadow.results.pdf (33.1 KB) - added by pastly 7 months ago.
Kernel queue time
shadow.results.pdf (1.8 MB) - added by pastly 7 months ago.
Other shadow results

Download all attachments as: .zip

Change History (16)

comment:1 Changed 7 months ago by teor

Wow, that's awkward.
Thanks for finding this.
We should add some unit tests to make sure that we don't regress.

comment:3 Changed 7 months ago by dgoulet

  • Milestone set to Tor: 0.3.0.x-final
  • Priority changed from Medium to High
  • Version changed from Tor: unspecified to Tor: 0.2.6.2-alpha

comment:4 Changed 7 months ago by dgoulet

  • Status changed from new to merge_ready

lgtm; There could be an argument for "Major bugfixes" in the change file but I let that to nickm :). Thanks for this!

comment:5 follow-up: Changed 7 months ago by dgoulet

Oh and consider backporting probably?

comment:6 in reply to: ↑ 5 Changed 7 months ago by teor

Replying to dgoulet:

Oh and consider backporting probably?

Not for a performance issue, surely?

I could see an argument for a backport to 0.2.9, but that would risk destabilising the whole 0.2.9 series.

Since we've never actually run much of this code, it could do anything. It could be broken in other ways.

comment:7 Changed 7 months ago by pastly

  • Status changed from merge_ready to needs_review

FWIW, I've watched the order the scheduler picks channels. A few months ago, based on memory, the scheduler would pick channels in an order something like this:

3, 2, 5, 3, 2, 5, 3, 2, 3, 2, 3, 2, 3, 3, 3, 3

With this fix, it picks channels in an order something like this:

11, 7, 7, 1, 1, 1, 1, 1

I encourage us to verify the before behavior as memory <<< logs.

Of course, this tells us nothing about how circuits are picked within a channel, which is very important.

I'm moving this out of merge_ready because I think we have too many questions right now. Hopefully needs_review is a more accurate description.

comment:8 Changed 7 months ago by nickm

  • Keywords 029-backport added
  • Milestone changed from Tor: 0.3.0.x-final to Tor: 0.2.9.x-final

Gosh. I can also see getting this in place for 0.2.9. But it does at least seem like an obvious merge for 0.3.0. I'll merge for master. If we choose to backport to 0.2.9, the thing to cherry-pick is c09993fdf6beb80d8c5f34250092c931333f7ac0

comment:9 Changed 7 months ago by nickm

  • Keywords review-group-11 added

comment:10 Changed 7 months ago by dgoulet

FWIW, I've been running this in the test network and on my real network relay for >24h. I'm monitoring both of them so I'll inform if anything blow up but so far so good. Although, I'm having a hard time so see how we can measure the effect of this fix except using shadow at large scale with a before and after.

comment:11 Changed 7 months ago by pastly

Also FWIW, I'm currently running this change in small Shadow networks as I'm testing other things. I'm measuring kernel queue time, tor queue time, network throughput, and time to download. By these four metrics, this change doesn't hurt a relay in a small network. It also doesn't really seem to help.

I started a large Shadow simulation a couple days ago, but it died sometime last night :(. I've started it again. I will report back when it has finished. I suspect the small network was too small for the difference to be noticeable. Namely

  • not enough channels from one relay to ~all others
  • not enough circuits per channel

The large network has 500 relays and 7500 clients compared to the small's 11 and 12 respectively.

comment:12 Changed 7 months ago by nickm

Calling this "no backport for 0.2.9 right now" per discussion at meeting today. It's a small change, but it is fundamental and needs more testing.

comment:13 Changed 7 months ago by nickm

  • Milestone changed from Tor: 0.2.9.x-final to Tor: 0.3.0.x-final
  • Resolution set to fixed
  • Status changed from needs_review to closed

comment:14 Changed 7 months ago by pastly

Ready for some graphs? This data was collected from two shadow simulations. Each has 500 relays and 7500 clients. The one line summary of the results (to me) is: the ewma fix does no harm by the metrics I gathered.

qtime.shadow.results.pdf shows the amount of time a cell spends in the kernel outbound buffer after leaving Tor and becoming bytes. As you can see, no change.

shadow.results.pdf shows a lot.

  • Time to download x bytes are for each different type of client. Clients with a smaller x behave a lot like web browsers. Clients with larger x are near continuous bulk downloads. Looks like no change.
  • A bunch of probably self-explanatory graphs all showing no change.
  • 60 second moving average read (pg 22) and write (pg 25). We give the network 30 simulation minutes for every relay/client to boot up and reach steady state, then measure for 10 simulation minutes. 'After' seems more stable than 'before', which would explain the better read and write performance for 'after'.

So, IMO, the fix causes no harm and is beneficial. I can answer any questions about what graphs mean, what the simulations consist of, etc.

I will attach the files right after hitting submit on this. Because I can't do it all at once ... ?

Changed 7 months ago by pastly

Kernel queue time

Changed 7 months ago by pastly

Other shadow results

Note: See TracTickets for help on using tickets.