Simulate global circuit scheduling from #9262

added Shadow component::archived/general owner::robgjansen parent::12541 priority::medium resolution::fixed status::closed type::task labels

Trac:
shadowtor-400r-perf1.pdf

Trac:
shadowtor-400r-perf2.pdf

I simulated vanilla Tor 0.2.5.6-alpha as well as Roger's cmux-0256 branch using Shadow. I assume that the new global circuit scheduling approach is enabled by default in the cmux branch since I didn't notice any new config options related to it.

I'm not exactly sure how to validate that cmux is working correctly. I drew the performance graphs I typically use to understand how things are working at a high level. The results are attached here and here. (The two sets of graphs are drawn on the same data.)

If working correctly, then the EWMA circuit scheduler should be doing a better job of de-prioritizing circuits as more and more bytes flow through it. The graphs seem to indicate that global scheduling improves latency (first byte), but most total download times have gotten a bit worse. As a result of Shadow's client model, longer web download times means fewer web downloads will complete over the entire simulation. The graphs also show this.

I think more data analysis is a good idea to assert correctness and determine how global scheduling affects circuit EWMA values and throughput. I am requesting feedback about how to do that, and especially about how to push this task forward.

Trac:
Username: nikita
Cc: N/A to nikita

Discussion in #tor-dev:

{{{ 06:32 < nickm> athena: neat. I've been reading it and I hope Yawning has too 06:32 < nickm> athena: have you looked at rob's experimental results that he asked about? 06:33 < nickm> (See #12889 (closed)) 06:37 < nickm> I wonder what we should suggest that Rob try next 06:37 < nickm> And how we can find out if this is a bug, or as-intended, or what 06:39 < athena> the most interesting thing that comes to mind is varying the thresholds 06:40 < athena> in particular, in the limit of very high thresholds the behavior should converge to something like the old behavior, modulo maybe a little higher latency for triggering the new mechanism through libevent and all 06:41 < athena> if the gap persists even when the global high/low water marks are set so high we start sending as soon as a circuit has anything to send, we're basically scheduling one circuit at a time like without the global scheduler 06:42 < Yawning> hmm 06:43 < nickm> there's also the possibility that something is going on we don't expect. I wonder how we can figure out which. 06:43 < nickm> and/or confirm your hypotheses above 06:43 < Yawning> run the case athena just suggested and see if the behavior is what we expect? 06:43 < nickm> hm. Plausible.

If we end up wanting to experiment with the thresholds a lot, it might be useful to turn them into a config option.

Thanks for posting these notes. I think a config option to vary the threshold would be a great idea, and would help us learn more about the code than that it is functional. If config option(s) existed, running more simulations that vary the configs would be a good place to continue here.

New version with with SchedulerLowWaterMark, SchedulerHighWaterMark and SchedulerMaxFlushCells is available in my cmux_refactor_configurable_threshold branch.

Rob, any results from testing that?

Anybody want to try this on a live relay, possibly an exit? I'd like to know more about what happens.

I apologize for not posting an update sooner, but I have been working slowly on this over the last few weeks. I have merged andrea's cmux_refactor_configurable_threshold branch with 0.2.5.10 and have run into a mutex unlock bug that I have not yet had a chance to chase down due. I have several experiments set up and ready to launch as soon as I fix this bug, which should happen rsn now that I am finished traveling.

And I'll post updates more regularly, even if they don't indicate as much progress as I would like.

To be more specific about the bug, the Tor error message is this:

[thread-9] 00:01:15:000000000 [scallion-error] [relayexitguard11-106.187.52.188] [scalliontor_logmsg_cb] Error 1 unlocking a mutex.

The backtrace is here:

**BEGIN BACKTRACE**
Obtained 30 stack frames:
        /home/rob/.shadow/bin/shadow() [0x43c3fc]
        /home/rob/.shadow/bin/shadow(utility_handleError+0x34) [0x43be74]
        /home/rob/.shadow/bin/shadow(logging_handleLog+0x1df) [0x41263f]
        /lib64/libglib-2.0.so.0(g_logv+0x209) [0x3596450429]
        /home/rob/.shadow/bin/shadow(logging_logv+0x41c) [0x412a6c]
        /home/rob/.shadow/bin/shadow() [0x426ada]
        /tmp/4XOYOX-libshadow-plugin-tor.so(+0x46301) [0x7ffe79b22301]
        /tmp/4XOYOX-libshadow-plugin-tor.so(+0x1a60f3) [0x7ffe79c820f3]
        /tmp/4XOYOX-libshadow-plugin-tor.so(log_fn_+0x17d) [0x7ffe79c804ad]
        /tmp/4XOYOX-libshadow-plugin-tor.so(tor_mutex_release+0xab) [0x7ffe79c7af7b]
        /tmp/4XOYOX-libshadow-plugin-tor.so(+0x199df0) [0x7ffe79c75df0]
        /home/rob/.shadow/lib/libcrypto.so.1.0.0(+0x10836c) [0x7ffe8a27a36c]
        /home/rob/.shadow/lib/libcrypto.so.1.0.0(ERR_load_ERR_strings+0x9b) [0x7ffe8a27a8db]
        /home/rob/.shadow/lib/libcrypto.so.1.0.0(ERR_load_crypto_strings+0x9) [0x7ffe8a27c379]
        /tmp/SY0LOX-libshadow-plugin-tor.so(crypto_early_init+0x35) [0x7ffe795faba5]
        /home/rob/.shadow/lib/libshadow-preload-tor.so(crypto_early_init+0x36) [0x7ffe8b8ea366]
        /tmp/SY0LOX-libshadow-plugin-tor.so(tor_init+0x8e) [0x7ffe795c9e8e]
        /tmp/SY0LOX-libshadow-plugin-tor.so(scalliontor_start+0x4d) [0x7ffe794ae68d]
        /tmp/SY0LOX-libshadow-plugin-tor.so(scalliontor_new+0x127) [0x7ffe794ae9b7]
        /tmp/SY0LOX-libshadow-plugin-tor.so(+0x43bfe) [0x7ffe794adbfe]
        /home/rob/.shadow/bin/shadow(thread_executeNew+0xd8) [0x427048]
        /home/rob/.shadow/bin/shadow(process_start+0x194) [0x426184]
        /home/rob/.shadow/bin/shadow(host_startApplication+0x64) [0x428644]
        /home/rob/.shadow/bin/shadow(startapplication_run+0x8d) [0x433bcd]
        /home/rob/.shadow/bin/shadow(shadowevent_run+0x167) [0x432c07]
        /home/rob/.shadow/bin/shadow() [0x40ff7e]
        /home/rob/.shadow/bin/shadow(worker_runParallel+0xcf) [0x40fd2f]
        /lib64/libglib-2.0.so.0() [0x359646ea45]
        /lib64/libpthread.so.0() [0x3593c07ee5]
        /lib64/libc.so.6(clone+0x6d) [0x35938f4b8d]
**END BACKTRACE**

The error does not occur on 0.2.5.10, but does occur on 0.2.5.10 merged with the cmux_refactor_configurable_threshold branch.

I do not believe this is Tor's bug, but a problem with the way Shadow's worker threads initialize openssl. Basically, crypto_global_init and crypto_early_init should only be called per shadow worker thread, rather than once per Tor node. I thought I fixed this in this commit (and it would appear so since 0.2.5.10 works fine), but apparently there are some differences in the way that is handled in Andrea's branch.

I believe the issue has been fixed on Shadow's end in this commit. I have simulations running now on Andrea's cmux_refactor_configurable_threshold branch.

Trac:
cmux-sanity.shadowtor.pdf.xz

My sanity-check experiments finished. In one experiment I ran 'vanilla' Tor with stable release 0.2.5.10, and for the other experiment I used the cmux_refactor_configurable_threshold branch and the following settings:

SchedulerLowWaterMark 100MB
SchedulerHighWaterMark 101MB
SchedulerMaxFlushCells 1000

The results confirm that the new branch with the above settings results in performance very similar to vanilla Tor.

The network model included 400 relays and 1200 clients downloading files of the various sizes shown in the graphs. Keep in mind that in this smaller network, there will be some amount of variance in these experiments due to the different Tor builds being run. I could run this on an updated full ShadowTor network of 6000 relays when I finish producing one, but I don't think these results warrant doing so.

I believe that merging this branch and using settings similar to those above would not destroy performance. I'll next show some results of testing with different parameters.

(Second try, after I lost my first long and detailed explanation to trac.)

I ran a set of experiments varying the three parameters. The graphs showing the results are attached in parts due to file upload size limits. (I accidentally attached them to #12541 (moved).)

highwater: part0 part1 lowwater: part0 part1 maxflush: part0 part1

I compressed and split the PDF files like

xz cmux-highwater.shadowtor.pdf
split -b 2500K -a 1 -d cmux-highwater.shadowtor.pdf.xz cmux-highwater.shadowtor.pdf.xz-part

They can be reconstructed like

cat cmux-highwater.shadowtor.pdf.xz-part0 cmux-highwater.shadowtor.pdf.xz-part1 > cmux-highwater.shadowtor.pdf.xz
xz -d highwater.shadowtor.pdf.xz

The results indicate that the combination of settings that I tested do not result in improved download times without decreasing network throughput. These findings are consistent with the testing of our global scheduling prototype that we performed for the KIST paper.

The issue here is that we are missing the kernel/tcp information (#12890 (moved)) to help us make intelligent decisions about which channels should get data and which ones should not. Without that information, the best approach seems to be the greedy one where the scheduler immediately sends as much as it can to each channel in an attempt to maximize throughput. Of course, this means the circuit scheduler is having little effect, and Tor will not be able to prioritize low EWMA circuits over high EWMA circuits correctly - the reason for doing this in the first place.

In our KIST work, we also found that global scheduling alone did not improve things dramatically. The real performance benefits were realized after doing BOTH the global scheduling AND the socket write limits parts of KIST (#12890 (moved)) - the approaches work hand-in-hand to intelligently set the high watermark for each channel. I expect similar results here.

We can either merge this branch and use the settings that result in the old bahavior until #12890 (moved) is completed, or we can wait until #12890 (moved) is completed on top of this branch and we have more simulation results.

I think this ticket is done and we can move on to #12890 (moved). Is anything else wanted here?

My inclination is that we're done here. I think merging #9262 (moved) with amended setting to get old behavior, and then using it as the basis for #12890 (moved), could be sensible. (Does it make a good basis for #12890 (moved)?)

Replying to nickm:

(Does it make a good basis for #12890 (moved)?)

Yes it does make a good basis and I believe the global scheduler will be needed for #12890 (moved) to work best. However, while #12890 (moved) could be designed to adjust the high watermark setting from this branch, it could also be designed independently of this branch if we wanted to test each feature separately. I think the former would be less work.

Okay. That matches my judgment too. Let's do #12890 (moved) on top of tor+this.

Closing this out as I do not intend to do more simulation.

Trac:
Status: new to closed
Resolution: N/A to fixed

closed

Simulate global circuit scheduling from #9262

Child items 0

Activity