Opened 5 years ago

Closed 2 years ago

#12891 closed task (fixed)

Simulate KIST - global scheduling (#9262) and socket write limits (#12890)

Reported by: robgjansen Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version:
Severity: Normal Keywords: tor-relay kist simulation analysis research-program
Cc: Actual Points:
Parent ID: #12541 Points: large
Reviewer: Sponsor:

Description

Simulate #9262 + #12890 in Shadow to test KISTs effect on performance. Come up with good default parameters that work well to reduce congestion/latency.

Child Tickets

Attachments (6)

shadow.perf.results.pdf (488.8 KB) - added by robgjansen 5 years ago.
shadow.goodput.results.pdf (1.6 MB) - added by robgjansen 5 years ago.
shadowtorperf-ideal-model-from-ccs-peerflow.pdf (164.7 KB) - added by robgjansen 4 years ago.
Torperf vs Shadowperf under new Tor network model
kist.stablemodel.shadow.results-sm.pdf (1.5 MB) - added by robgjansen 4 years ago.
varied-load.png (394.1 KB) - added by pastly 2 years ago.
varied-loss.png (388.6 KB) - added by pastly 2 years ago.

Change History (30)

comment:1 Changed 5 years ago by robgjansen

Summary: Simulate KIST - global scheduling (#9262) + socket write limits (#12890)Simulate KIST - global scheduling (#9262) and socket write limits (#12890)

comment:2 Changed 5 years ago by nickm

Component: - Select a componentTor
Keywords: tor-relay added

comment:3 Changed 5 years ago by nickm

Milestone: Tor: 0.2.6.x-final

comment:4 Changed 5 years ago by nickm

Milestone: Tor: 0.2.6.x-finalTor: 0.2.7.x-final

I'm tentatively bumping KIST stuff to 0.2.7.x, since I think it won't be done this month. Please let me know if I'm wrong

comment:5 Changed 5 years ago by nickm

I'm tentatively bumping KIST stuff to 0.2.7.x, since I think it won't be done this month. Please let me know if I'm wrong

comment:6 Changed 5 years ago by robgjansen

After numerous big fixes and new Tor network model, I finally have some initial results. I tested nickm/kist at commit 55814effcb96ff4998e75a2136ac2ed631247d8a with UseKIST 0 and UseKIST 1.

The results totally blow. Downloads times increased dramatically when using the new feature and Tor queue times were unchanged. I also noticed download failure modes around 5 minutes and 10 minutes, and am still looking into the cause.

None of this makes sense to me yet, and I have little confidence in the results I got. (Specifically, the queue times should have at least slightly changed, but I have not observed that.) My current thinking is that the problems are primarily due to the changes that occurred as a result of using a new network model. So my next step is to simulate under our older, stable model and go from there.

Changed 5 years ago by robgjansen

Attachment: shadow.perf.results.pdf added

Changed 5 years ago by robgjansen

Attachment: shadow.goodput.results.pdf added

comment:7 Changed 5 years ago by robgjansen

Update time.

I went back to the stable Tor network topology model that we used in the KIST paper in order to verify the performance issues I alluded to in my last post. The topology contained 3600 relays and 12000 clients. I ran nickm's kist branch merged with tor-0.2.6.2-alpha, in order to take advantage of the new TestingDirAuthVoteGuard and TestingDirAuthVoteExit options. I ran one experiment with UseKIST 1 and another with UseKIST 0.

The results are very similar to those obtained from my old topology and last set of experiments. The performance and throughput results are attached. As you can see, performance is worse when using the current KIST implementation than without it, and aggregate network throughput drops by almost half.

My current thinking is that we are starving the kernel and therefore not utilizing all available bandwidth of the relays, but more logging in the KIST branch would help give us some hard data about this potential problem.

Next I want to play around with KISTSockBufSizeFactor, so that we always write much more to the buffer than we think we need to. For example, I could set it extremely high to approximate the old behavior and make sure we avoid kernel starvation. I think that will give us a useful data point.

comment:8 Changed 5 years ago by nickm

Owner: robgjansen deleted
Status: newassigned

comment:9 Changed 5 years ago by nickm

Keywords: 027-triaged-1-out added

Marking triaged-out items from first round of 0.2.7 triage.

comment:10 Changed 5 years ago by nickm

Milestone: Tor: 0.2.7.x-finalTor: 0.2.???

Make all non-needs_review, non-needs_revision, 027-triaged-1-out items belong to 0.2.???

Changed 4 years ago by robgjansen

Torperf vs Shadowperf under new Tor network model

comment:11 Changed 4 years ago by robgjansen

I believe the results I posted above are invalid. I recently found and fixed several bugs in Shadow which affected network performance, and created a more recent model of Tor that we were using for our peerflow experiments. I have higher confidence in this model after running many many experiments with it and analyzing results obtained with it.

I compared Torperf performance in Tor vs in Shadow with my new fancy model. Those results are attached here. It appears that Shadow is again tracking Tor performance nicely. (I believe the difference in time to first byte is because Karsten and I are starting our download timers at different points, which we just realized this week.)

Version 0, edited 4 years ago by robgjansen (next)

Changed 4 years ago by robgjansen

comment:12 Changed 4 years ago by robgjansen

Update: using the model described in this comment, I ran KIST simulations using a variety of KISTSockBufsizeFactor settings (0.5, 1.5, 3.0) and compared the performance results against UseKIST 0 (vanilla Tor). The results are attached here. The high level result is that there was an insignificant change in performance among all settings tested.

One possibility for the insignificant performance change is that the network is not congested enough for KIST to make a difference. To better understand this possibility, I'd like to run some cell tracking code that allows us to compute the Tor application and the shadow kernel buffer times. We can then compare buffer times, and see how those change as we add load to the network (e.g., by doubling the number of clients).

comment:13 Changed 4 years ago by nickm

Milestone: Tor: 0.2.???Tor: 0.2.8.x-final

comment:14 Changed 4 years ago by nickm

Points: large

comment:15 Changed 4 years ago by nickm

Milestone: Tor: 0.2.8.x-finalTor: 0.2.???

It is impossible that we will fix all 252 currently open 028 tickets before 028 releases. Time to move some out. This is my first pass through the "assigned" tickets with no owner, looking for things to move to ???.

If somebody thinks they can get these done before the 0.2.8 timeout, please assign it to yourself and move it back?

comment:16 Changed 3 years ago by teor

Milestone: Tor: 0.2.???Tor: 0.3.???

Milestone renamed

comment:17 Changed 3 years ago by nickm

Keywords: tor-03-unspecified-201612 added
Milestone: Tor: 0.3.???Tor: unspecified

Finally admitting that 0.3.??? was a euphemism for Tor: unspecified all along.

comment:18 Changed 2 years ago by nickm

Keywords: tor-03-unspecified-201612 removed

Remove an old triaging keyword.

comment:19 Changed 2 years ago by nickm

Keywords: 027-triaged-in added

comment:20 Changed 2 years ago by nickm

Keywords: 027-triaged-in removed

comment:21 Changed 2 years ago by nickm

Keywords: 027-triaged-1-out removed

comment:22 Changed 2 years ago by nickm

Status: assignednew

Change the status of all assigned/accepted Tor tickets with owner="" to "new".

comment:23 Changed 2 years ago by nickm

Keywords: kist simulation analysis research-program added
Severity: Normal

comment:24 Changed 2 years ago by pastly

Resolution: fixed
Status: newclosed

As discussed in Wilmington, Shadow results look good. Will attach graphs after this comment.

The shadow network used had 2000 relays, ~50k clients, and 5000 servers. Load was varied by adding/removing ~20k clients. Packet loss was varied from 0%, 1.5%, 3%.

The TL;DR is the worse the network conditions, the better KIST does relative to vanilla Tor (AKA "AMAP" in the graphs).

Changed 2 years ago by pastly

Attachment: varied-load.png added

Changed 2 years ago by pastly

Attachment: varied-loss.png added
Note: See TracTickets for help on using tickets.