I simulated vanilla Tor 0.2.5.6-alpha as well as Roger's cmux-0256 branch using Shadow. I assume that the new global circuit scheduling approach is enabled by default in the cmux branch since I didn't notice any new config options related to it.
I'm not exactly sure how to validate that cmux is working correctly. I drew the performance graphs I typically use to understand how things are working at a high level. The results are attached here and here. (The two sets of graphs are drawn on the same data.)
If working correctly, then the EWMA circuit scheduler should be doing a better job of de-prioritizing circuits as more and more bytes flow through it. The graphs seem to indicate that global scheduling improves latency (first byte), but most total download times have gotten a bit worse. As a result of Shadow's client model, longer web download times means fewer web downloads will complete over the entire simulation. The graphs also show this.
I think more data analysis is a good idea to assert correctness and determine how global scheduling affects circuit EWMA values and throughput. I am requesting feedback about how to do that, and especially about how to push this task forward.
{{{
06:32 < nickm> athena: neat. I've been reading it and I hope Yawning has too
06:32 < nickm> athena: have you looked at rob's experimental results that he
asked about?
06:33 < nickm> (See #12889 (closed))
06:37 < nickm> I wonder what we should suggest that Rob try next
06:37 < nickm> And how we can find out if this is a bug, or as-intended, or what
06:39 < athena> the most interesting thing that comes to mind is varying the
thresholds
06:40 < athena> in particular, in the limit of very high thresholds the behavior
should converge to something like the old behavior, modulo maybe
a little higher latency for triggering the new mechanism through
libevent and all
06:41 < athena> if the gap persists even when the global high/low water marks are
set so high we start sending as soon as a circuit has anything
to send, we're basically scheduling one circuit at a time like
without the global scheduler
06:42 < Yawning> hmm
06:43 < nickm> there's also the possibility that something is going on we don't
expect. I wonder how we can figure out which.
06:43 < nickm> and/or confirm your hypotheses above
06:43 < Yawning> run the case athena just suggested and see if the behavior is
what we expect?
06:43 < nickm> hm. Plausible.
Thanks for posting these notes. I think a config option to vary the threshold would be a great idea, and would help us learn more about the code than that it is functional. If config option(s) existed, running more simulations that vary the configs would be a good place to continue here.
New version with with SchedulerLowWaterMark, SchedulerHighWaterMark and SchedulerMaxFlushCells is available in my cmux_refactor_configurable_threshold branch.
I apologize for not posting an update sooner, but I have been working slowly on this over the last few weeks. I have merged andrea's cmux_refactor_configurable_threshold branch with 0.2.5.10 and have run into a mutex unlock bug that I have not yet had a chance to chase down due. I have several experiments set up and ready to launch as soon as I fix this bug, which should happen rsn now that I am finished traveling.
And I'll post updates more regularly, even if they don't indicate as much progress as I would like.
The error does not occur on 0.2.5.10, but does occur on 0.2.5.10 merged with the cmux_refactor_configurable_threshold branch.
I do not believe this is Tor's bug, but a problem with the way Shadow's worker threads initialize openssl. Basically, crypto_global_init and crypto_early_init should only be called per shadow worker thread, rather than once per Tor node. I thought I fixed this in this commit (and it would appear so since 0.2.5.10 works fine), but apparently there are some differences in the way that is handled in Andrea's branch.
I believe the issue has been fixed on Shadow's end in this commit. I have simulations running now on Andrea's cmux_refactor_configurable_threshold branch.
My sanity-check experiments finished. In one experiment I ran 'vanilla' Tor with stable release 0.2.5.10, and for the other experiment I used the cmux_refactor_configurable_threshold branch and the following settings:
The results confirm that the new branch with the above settings results in performance very similar to vanilla Tor.
The network model included 400 relays and 1200 clients downloading files of the various sizes shown in the graphs. Keep in mind that in this smaller network, there will be some amount of variance in these experiments due to the different Tor builds being run. I could run this on an updated full ShadowTor network of 6000 relays when I finish producing one, but I don't think these results warrant doing so.
I believe that merging this branch and using settings similar to those above would not destroy performance. I'll next show some results of testing with different parameters.
(Second try, after I lost my first long and detailed explanation to trac.)
I ran a set of experiments varying the three parameters. The graphs showing the results are attached in parts due to file upload size limits. (I accidentally attached them to #12541 (moved).)
The results indicate that the combination of settings that I tested do not result in improved download times without decreasing network throughput. These findings are consistent with the testing of our global scheduling prototype that we performed for the KIST paper.
The issue here is that we are missing the kernel/tcp information (#12890 (moved)) to help us make intelligent decisions about which channels should get data and which ones should not. Without that information, the best approach seems to be the greedy one where the scheduler immediately sends as much as it can to each channel in an attempt to maximize throughput. Of course, this means the circuit scheduler is having little effect, and Tor will not be able to prioritize low EWMA circuits over high EWMA circuits correctly - the reason for doing this in the first place.
In our KIST work, we also found that global scheduling alone did not improve things dramatically. The real performance benefits were realized after doing BOTH the global scheduling AND the socket write limits parts of KIST (#12890 (moved)) - the approaches work hand-in-hand to intelligently set the high watermark for each channel. I expect similar results here.
We can either merge this branch and use the settings that result in the old bahavior until #12890 (moved) is completed, or we can wait until #12890 (moved) is completed on top of this branch and we have more simulation results.
My inclination is that we're done here. I think merging #9262 (moved) with amended setting to get old behavior, and then using it as the basis for #12890 (moved), could be sensible. (Does it make a good basis for #12890 (moved)?)
Yes it does make a good basis and I believe the global scheduler will be needed for #12890 (moved) to work best. However, while #12890 (moved) could be designed to adjust the high watermark setting from this branch, it could also be designed independently of this branch if we wanted to test each feature separately. I think the former would be less work.