Simulate KIST - global scheduling (#9262) and socket write limits (#12890)

changed milestone to %Tor: unspecified

added analysis component::core tor/tor kist milestone::Tor: unspecified parent::12541 points::large priority::medium research-program resolution::fixed severity::normal simulation status::closed tor-relay type::task labels

Trac:
Summary: Simulate KIST - global scheduling (#9262 (moved)) + socket write limits (#12890 (moved)) to Simulate KIST - global scheduling (#9262 (moved)) and socket write limits (#12890 (moved))

Trac:
Component: - Select a component to Tor
Keywords: N/A deleted, tor-relay added

Trac:
Milestone: N/A to Tor: 0.2.6.x-final

I'm tentatively bumping KIST stuff to 0.2.7.x, since I think it won't be done this month. Please let me know if I'm wrong

Trac:
Milestone: Tor: 0.2.6.x-final to Tor: 0.2.7.x-final

I'm tentatively bumping KIST stuff to 0.2.7.x, since I think it won't be done this month. Please let me know if I'm wrong

After numerous big fixes and new Tor network model, I finally have some initial results. I tested nickm/kist at commit 55814effcb96ff4998e75a2136ac2ed631247d8a with UseKIST 0 and UseKIST 1.

The results totally blow. Downloads times increased dramatically when using the new feature and Tor queue times were unchanged. I also noticed download failure modes around 5 minutes and 10 minutes, and am still looking into the cause.

None of this makes sense to me yet, and I have little confidence in the results I got. (Specifically, the queue times should have at least slightly changed, but I have not observed that.) My current thinking is that the problems are primarily due to the changes that occurred as a result of using a new network model. So my next step is to simulate under our older, stable model and go from there.

Trac:
shadow.perf.results.pdf

Trac:
shadow.goodput.results.pdf

Update time.

I went back to the stable Tor network topology model that we used in the KIST paper in order to verify the performance issues I alluded to in my last post. The topology contained 3600 relays and 12000 clients. I ran nickm's kist branch merged with tor-0.2.6.2-alpha, in order to take advantage of the new TestingDirAuthVoteGuard and TestingDirAuthVoteExit options. I ran one experiment with UseKIST 1 and another with UseKIST 0.

The results are very similar to those obtained from my old topology and last set of experiments. The performance and throughput results are attached. As you can see, performance is worse when using the current KIST implementation than without it, and aggregate network throughput drops by almost half.

My current thinking is that we are starving the kernel and therefore not utilizing all available bandwidth of the relays, but more logging in the KIST branch would help give us some hard data about this potential problem.

Next I want to play around with KISTSockBufSizeFactor, so that we always write much more to the buffer than we think we need to. For example, I could set it extremely high to approximate the old behavior and make sure we avoid kernel starvation. I think that will give us a useful data point.

Trac:
Status: new to assigned
Owner: robgjansen to N/A

Marking triaged-out items from first round of 0.2.7 triage.

Trac:
Keywords: N/A deleted, 027-triaged-1-out added

Make all non-needs_review, non-needs_revision, 027-triaged-1-out items belong to 0.2.???

Trac:
Milestone: Tor: 0.2.7.x-final to Tor: 0.2.???

Trac:
shadowtorperf-ideal-model-from-ccs-peerflow.pdf

Torperf vs Shadowperf under new Tor network model

I believe the results I posted in comment 7 are invalid. I recently found and fixed several bugs in Shadow which affected network performance, and created a more recent model of Tor that we were using for our peerflow experiments. I have higher confidence in this model after running many many experiments with it and analyzing results obtained with it.

I compared Torperf performance in Tor vs in Shadow with my new fancy model. Those results are attached here. It appears that Shadow is again tracking Tor performance nicely. (I believe the difference in time to first byte is because Karsten and I are starting our download timers at different points, which we just realized this week.)

Trac:
kist.stablemodel.shadow.results-sm.pdf

Update: using the model described in this comment, I ran KIST simulations using a variety of KISTSockBufsizeFactor settings (0.5, 1.5, 3.0) and compared the performance results against UseKIST 0 (vanilla Tor). The results are attached here. The high level result is that there was an insignificant change in performance among all settings tested.

One possibility for the insignificant performance change is that the network is not congested enough for KIST to make a difference. To better understand this possibility, I'd like to run some cell tracking code that allows us to compute the Tor application and the shadow kernel buffer times. We can then compare buffer times, and see how those change as we add load to the network (e.g., by doubling the number of clients).

Trac:
Milestone: Tor: 0.2.??? to Tor: 0.2.8.x-final

Simulate KIST - global scheduling (#9262) and socket write limits (#12890)

Child items 0

Activity