Opened 5 years ago

Closed 5 years ago

#12889 closed task (fixed)

Simulate global circuit scheduling from #9262

Reported by: robgjansen Owned by: robgjansen
Priority: Medium Milestone:
Component: Archived/general Version:
Severity: Keywords: Shadow
Cc: nikita Actual Points:
Parent ID: #12541 Points:
Reviewer: Sponsor:


Simulate #9262 in Shadow to test its effect on performance.

Child Tickets

Attachments (3)

shadowtor-400r-perf1.pdf (311.5 KB) - added by robgjansen 5 years ago.
shadowtor-400r-perf2.pdf (203.2 KB) - added by robgjansen 5 years ago.
cmux-sanity.shadowtor.pdf.xz (2.2 MB) - added by robgjansen 5 years ago.

Change History (20)

Changed 5 years ago by robgjansen

Attachment: shadowtor-400r-perf1.pdf added

Changed 5 years ago by robgjansen

Attachment: shadowtor-400r-perf2.pdf added

comment:1 Changed 5 years ago by robgjansen

I simulated vanilla Tor as well as Roger's cmux-0256 branch using Shadow. I assume that the new global circuit scheduling approach is enabled by default in the cmux branch since I didn't notice any new config options related to it.

I'm not exactly sure how to validate that cmux is working correctly. I drew the performance graphs I typically use to understand how things are working at a high level. The results are attached here and here. (The two sets of graphs are drawn on the same data.)

If working correctly, then the EWMA circuit scheduler should be doing a better job of de-prioritizing circuits as more and more bytes flow through it. The graphs seem to indicate that global scheduling improves latency (first byte), but most total download times have gotten a bit worse. As a result of Shadow's client model, longer web download times means fewer web downloads will complete over the entire simulation. The graphs also show this.

I think more data analysis is a good idea to assert correctness and determine how global scheduling affects circuit EWMA values and throughput. I am requesting feedback about how to do that, and especially about how to push this task forward.

comment:2 Changed 5 years ago by nikita

Cc: nikita added

comment:3 Changed 5 years ago by andrea

Discussion in #tor-dev:

06:32 < nickm> athena: neat. I've been reading it and I hope Yawning has too
06:32 < nickm> athena: have you looked at rob's experimental results that he
               asked about?
06:33 < nickm> (See #12889)
06:37 < nickm> I wonder what we should suggest that Rob try next
06:37 < nickm> And how we can find out if this is a bug, or as-intended, or what
06:39 < athena> the most interesting thing that comes to mind is varying the
06:40 < athena> in particular, in the limit of very high thresholds the behavior
                should converge to something like the old behavior, modulo maybe
                a little higher latency for triggering the new mechanism through
                libevent and all
06:41 < athena> if the gap persists even when the global high/low water marks are
                set so high we start sending as soon as a circuit has anything
                to send, we're basically scheduling one circuit at a time like
                without the global scheduler
06:42 < Yawning> hmm
06:43 < nickm> there's also the possibility that something is going on we don't
               expect.  I wonder how we can figure out which.
06:43 < nickm> and/or confirm your hypotheses above
06:43 < Yawning> run the case athena just suggested and see if the behavior is
                 what we expect?
06:43 < nickm> hm.  Plausible.

comment:4 Changed 5 years ago by andrea

If we end up wanting to experiment with the thresholds a lot, it might be useful to turn them into a config option.

comment:5 Changed 5 years ago by robgjansen

Thanks for posting these notes. I think a config option to vary the threshold would be a great idea, and would help us learn more about the code than that it is functional. If config option(s) existed, running more simulations that vary the configs would be a good place to continue here.

comment:6 Changed 5 years ago by andrea

New version with with SchedulerLowWaterMark, SchedulerHighWaterMark and SchedulerMaxFlushCells is available in my cmux_refactor_configurable_threshold branch.

comment:7 Changed 5 years ago by nickm

Rob, any results from testing that?

Anybody want to try this on a live relay, possibly an exit? I'd like to know more about what happens.

comment:8 Changed 5 years ago by robgjansen

I apologize for not posting an update sooner, but I have been working slowly on this over the last few weeks. I have merged andrea's cmux_refactor_configurable_threshold branch with and have run into a mutex unlock bug that I have not yet had a chance to chase down due. I have several experiments set up and ready to launch as soon as I fix this bug, which should happen rsn now that I am finished traveling.

And I'll post updates more regularly, even if they don't indicate as much progress as I would like.

comment:9 Changed 5 years ago by robgjansen

To be more specific about the bug, the Tor error message is this:

[thread-9] 00:01:15:000000000 [scallion-error] [relayexitguard11-] [scalliontor_logmsg_cb] Error 1 unlocking a mutex.

The backtrace is here:

Obtained 30 stack frames:
        /home/rob/.shadow/bin/shadow() [0x43c3fc]
        /home/rob/.shadow/bin/shadow(utility_handleError+0x34) [0x43be74]
        /home/rob/.shadow/bin/shadow(logging_handleLog+0x1df) [0x41263f]
        /lib64/ [0x3596450429]
        /home/rob/.shadow/bin/shadow(logging_logv+0x41c) [0x412a6c]
        /home/rob/.shadow/bin/shadow() [0x426ada]
        /tmp/ [0x7ffe79b22301]
        /tmp/ [0x7ffe79c820f3]
        /tmp/ [0x7ffe79c804ad]
        /tmp/ [0x7ffe79c7af7b]
        /tmp/ [0x7ffe79c75df0]
        /home/rob/.shadow/lib/ [0x7ffe8a27a36c]
        /home/rob/.shadow/lib/ [0x7ffe8a27a8db]
        /home/rob/.shadow/lib/ [0x7ffe8a27c379]
        /tmp/ [0x7ffe795faba5]
        /home/rob/.shadow/lib/ [0x7ffe8b8ea366]
        /tmp/ [0x7ffe795c9e8e]
        /tmp/ [0x7ffe794ae68d]
        /tmp/ [0x7ffe794ae9b7]
        /tmp/ [0x7ffe794adbfe]
        /home/rob/.shadow/bin/shadow(thread_executeNew+0xd8) [0x427048]
        /home/rob/.shadow/bin/shadow(process_start+0x194) [0x426184]
        /home/rob/.shadow/bin/shadow(host_startApplication+0x64) [0x428644]
        /home/rob/.shadow/bin/shadow(startapplication_run+0x8d) [0x433bcd]
        /home/rob/.shadow/bin/shadow(shadowevent_run+0x167) [0x432c07]
        /home/rob/.shadow/bin/shadow() [0x40ff7e]
        /home/rob/.shadow/bin/shadow(worker_runParallel+0xcf) [0x40fd2f]
        /lib64/ [0x359646ea45]
        /lib64/ [0x3593c07ee5]
        /lib64/ [0x35938f4b8d]

The error does not occur on, but does occur on merged with the cmux_refactor_configurable_threshold branch.

I do not believe this is Tor's bug, but a problem with the way Shadow's worker threads initialize openssl. Basically, crypto_global_init and crypto_early_init should only be called per shadow worker thread, rather than once per Tor node. I *thought* I fixed this in this commit (and it would appear so since works fine), but apparently there are some differences in the way that is handled in Andrea's branch.

comment:10 Changed 5 years ago by robgjansen

I believe the issue has been fixed on Shadow's end in this commit. I have simulations running now on Andrea's cmux_refactor_configurable_threshold branch.

Changed 5 years ago by robgjansen

comment:11 Changed 5 years ago by robgjansen

My sanity-check experiments finished. In one experiment I ran 'vanilla' Tor with stable release, and for the other experiment I used the cmux_refactor_configurable_threshold branch and the following settings:

SchedulerLowWaterMark 100MB
SchedulerHighWaterMark 101MB
SchedulerMaxFlushCells 1000

The results confirm that the new branch with the above settings results in performance very similar to vanilla Tor.

The network model included 400 relays and 1200 clients downloading files of the various sizes shown in the graphs. Keep in mind that in this smaller network, there will be some amount of variance in these experiments due to the different Tor builds being run. I could run this on an updated full ShadowTor network of 6000 relays when I finish producing one, but I don't think these results warrant doing so.

I believe that merging this branch and using settings similar to those above would not destroy performance. I'll next show some results of testing with different parameters.

comment:12 Changed 5 years ago by robgjansen

(Second try, after I lost my first long and detailed explanation to trac.)

I ran a set of experiments varying the three parameters. The graphs showing the results are attached in parts due to file upload size limits. (I accidentally attached them to #12541.)

highwater: part0 part1
lowwater: part0 part1
maxflush: part0 part1

I compressed and split the PDF files like

xz cmux-highwater.shadowtor.pdf
split -b 2500K -a 1 -d cmux-highwater.shadowtor.pdf.xz cmux-highwater.shadowtor.pdf.xz-part

They can be reconstructed like

cat cmux-highwater.shadowtor.pdf.xz-part0 cmux-highwater.shadowtor.pdf.xz-part1 > cmux-highwater.shadowtor.pdf.xz
xz -d highwater.shadowtor.pdf.xz

The results indicate that the combination of settings that I tested do not result in improved download times without decreasing network throughput. These findings are consistent with the testing of our global scheduling prototype that we performed for the KIST paper.

The issue here is that we are missing the kernel/tcp information (#12890) to help us make intelligent decisions about which channels should get data and which ones should not. Without that information, the best approach seems to be the greedy one where the scheduler immediately sends as much as it can to each channel in an attempt to maximize throughput. Of course, this means the circuit scheduler is having little effect, and Tor will not be able to prioritize low EWMA circuits over high EWMA circuits correctly - the reason for doing this in the first place.

In our KIST work, we also found that global scheduling alone did not improve things dramatically. The real performance benefits were realized after doing BOTH the global scheduling AND the socket write limits parts of KIST (#12890) - the approaches work hand-in-hand to intelligently set the high watermark for each channel. I expect similar results here.

We can either merge this branch and use the settings that result in the old bahavior until #12890 is completed, or we can wait until #12890 is completed on top of this branch and we have more simulation results.

comment:13 Changed 5 years ago by robgjansen

I think this ticket is done and we can move on to #12890. Is anything else wanted here?

comment:14 Changed 5 years ago by nickm

My inclination is that we're done here. I think merging #9262 with amended setting to get old behavior, and then using it as the basis for #12890, could be sensible. (Does it make a good basis for #12890?)

comment:15 in reply to:  14 Changed 5 years ago by robgjansen

Replying to nickm:

(Does it make a good basis for #12890?)

Yes it does make a good basis and I believe the global scheduler will be needed for #12890 to work best. However, while #12890 could be designed to adjust the high watermark setting from this branch, it could also be designed independently of this branch if we wanted to test each feature separately. I think the former would be less work.

comment:16 Changed 5 years ago by nickm_mobile

Okay. That matches my judgment too. Let's do #12890 on top of tor+this.

comment:17 Changed 5 years ago by robgjansen

Resolution: fixed
Status: newclosed

Closing this out as I do not intend to do more simulation.

Note: See TracTickets for help on using tickets.