Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#20459 closed defect (fixed)

ewma_cmp_cmux never considers policies different

Reported by: pastly Owned by:
Priority: High Milestone: Tor: 0.3.0.x-final
Component: Core Tor/Tor Version: Tor:
Severity: Normal Keywords: 029-backport, review-group-11
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:


Looks like a copy/paste bug from 700d6e75. See line 502.

The earliest branches I see this commit in are {maint,release}-0.2.6.

Branch incoming once I get a ticket number.

Child Tickets

Attachments (2)

qtime.shadow.results.pdf (33.1 KB) - added by pastly 4 years ago.
Kernel queue time
shadow.results.pdf (1.8 MB) - added by pastly 4 years ago.
Other shadow results

Download all attachments as: .zip

Change History (16)

comment:1 Changed 4 years ago by teor

Wow, that's awkward.
Thanks for finding this.
We should add some unit tests to make sure that we don't regress.

comment:3 Changed 4 years ago by dgoulet

Milestone: Tor: 0.3.0.x-final
Priority: MediumHigh
Version: Tor: unspecifiedTor:

comment:4 Changed 4 years ago by dgoulet

Status: newmerge_ready

lgtm; There could be an argument for "Major bugfixes" in the change file but I let that to nickm :). Thanks for this!

comment:5 Changed 4 years ago by dgoulet

Oh and consider backporting probably?

comment:6 in reply to:  5 Changed 4 years ago by teor

Replying to dgoulet:

Oh and consider backporting probably?

Not for a performance issue, surely?

I could see an argument for a backport to 0.2.9, but that would risk destabilising the whole 0.2.9 series.

Since we've never actually run much of this code, it could do anything. It could be broken in other ways.

comment:7 Changed 4 years ago by pastly

Status: merge_readyneeds_review

FWIW, I've watched the order the scheduler picks channels. A few months ago, based on memory, the scheduler would pick channels in an order something like this:

3, 2, 5, 3, 2, 5, 3, 2, 3, 2, 3, 2, 3, 3, 3, 3

With this fix, it picks channels in an order something like this:

11, 7, 7, 1, 1, 1, 1, 1

I encourage us to verify the before behavior as memory <<< logs.

Of course, this tells us nothing about how circuits are picked within a channel, which is very important.

I'm moving this out of merge_ready because I think we have too many questions right now. Hopefully needs_review is a more accurate description.

comment:8 Changed 4 years ago by nickm

Keywords: 029-backport added
Milestone: Tor: 0.3.0.x-finalTor: 0.2.9.x-final

Gosh. I can also see getting this in place for 0.2.9. But it does at least seem like an obvious merge for 0.3.0. I'll merge for master. If we choose to backport to 0.2.9, the thing to cherry-pick is c09993fdf6beb80d8c5f34250092c931333f7ac0

comment:9 Changed 4 years ago by nickm

Keywords: review-group-11 added

comment:10 Changed 4 years ago by dgoulet

FWIW, I've been running this in the test network and on my real network relay for >24h. I'm monitoring both of them so I'll inform if anything blow up but so far so good. Although, I'm having a hard time so see how we can measure the effect of this fix except using shadow at large scale with a before and after.

comment:11 Changed 4 years ago by pastly

Also FWIW, I'm currently running this change in small Shadow networks as I'm testing other things. I'm measuring kernel queue time, tor queue time, network throughput, and time to download. By these four metrics, this change doesn't hurt a relay in a small network. It also doesn't really seem to help.

I started a large Shadow simulation a couple days ago, but it died sometime last night :(. I've started it again. I will report back when it has finished. I suspect the small network was too small for the difference to be noticeable. Namely

  • not enough channels from one relay to ~all others
  • not enough circuits per channel

The large network has 500 relays and 7500 clients compared to the small's 11 and 12 respectively.

comment:12 Changed 4 years ago by nickm

Calling this "no backport for 0.2.9 right now" per discussion at meeting today. It's a small change, but it is fundamental and needs more testing.

comment:13 Changed 4 years ago by nickm

Milestone: Tor: 0.2.9.x-finalTor: 0.3.0.x-final
Resolution: fixed
Status: needs_reviewclosed

comment:14 Changed 4 years ago by pastly

Ready for some graphs? This data was collected from two shadow simulations. Each has 500 relays and 7500 clients. The one line summary of the results (to me) is: the ewma fix does no harm by the metrics I gathered.

qtime.shadow.results.pdf shows the amount of time a cell spends in the kernel outbound buffer after leaving Tor and becoming bytes. As you can see, no change.

shadow.results.pdf shows a lot.

  • Time to download x bytes are for each different type of client. Clients with a smaller x behave a lot like web browsers. Clients with larger x are near continuous bulk downloads. Looks like no change.
  • A bunch of probably self-explanatory graphs all showing no change.
  • 60 second moving average read (pg 22) and write (pg 25). We give the network 30 simulation minutes for every relay/client to boot up and reach steady state, then measure for 10 simulation minutes. 'After' seems more stable than 'before', which would explain the better read and write performance for 'after'.

So, IMO, the fix causes no harm and is beneficial. I can answer any questions about what graphs mean, what the simulations consist of, etc.

I will attach the files right after hitting submit on this. Because I can't do it all at once ... ?

Changed 4 years ago by pastly

Attachment: qtime.shadow.results.pdf added

Kernel queue time

Changed 4 years ago by pastly

Attachment: shadow.results.pdf added

Other shadow results

Note: See TracTickets for help on using tickets.