What knobs should we expose to be able to tune it in practice? What are the recommended initial guesses for those knobs, and how sensitive are our expected performance results to changes in the values, or changes in network load?
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Ok, the n23-2 branch in my git repo (git://git.torproject.org/arma/tor) is the patch we have to work with (see #4488 (moved) for how we got it). You can run it with "UseN23 1" to turn on the new behavior. If you leave UseN23 off, it should act like a normal Tor.
Completed download counts may give us a sense of load on the network.
vanilla:
24148 320KiB (web)
74 5MiB (bulk)
n23:
28767 320KiB (web)
237 5MiB (bulk)
I added the "window size hack" to my configuration, but ended up using the git version where that was presumably not needed. Would this cause any problems?
Completed download counts may give us a sense of load on the network.
vanilla:
24148 320KiB (web)
74 5MiB (bulk)
n23:
28767 320KiB (web)
237 5MiB (bulk)
So in this case the web downloads were a little bit slower, and the bulk downloads were a lot faster. And just as important, more of the downloads (especially bulk downloads) failed outright on the vanilla case. And most important of all, the code actually works.
What do you count as a failed download? Does your downloader cut off the attempt after 50 or 80 seconds or something? Do we have any sense of whether these failed downloads got 0 bytes or 'most of the bytes' before failing?
Mashael said, before seeing your results, "If N3Initial is small (50-70), it can throttle bulk downloaders and improve performance for browsers." So that's the clear next experiment: run the n23 side with "N3Initial 50". This means each circuit will have fewer cells in flight between relays (more accurately, sitting in the outbuf) at any given time.
(I also just noticed that ewma is off unless explicitly turned on. We should continue leaving it off for now, but we should keep in mind that at some point in the future we'll want to compare with 'on'.)
I added the "window size hack" to my configuration, but ended up using the git version where that was presumably not needed. Would this cause any problems?
What do you count as a failed download? Does your downloader cut off the attempt after 50 or 80 seconds or something? Do we have any sense of whether these failed downloads got 0 bytes or 'most of the bytes' before failing?
The downloaders do not "give up" unless the socks connection is closed by the other end. Although, I now realize that its probably smart to have the client "give up" so as to avoid knocking it out of the simulation because its download hung. Doesn't Tor kill the AP connection if it can't get a circuit by a deadline?
I do not explicitly count failed downloads, but I do track bytes received over time. So we could count partial downloads that never completed, and infer failed downloads as partials that were started before the last X minutes of the simulation (to avoid counting legitimate in-progress downloads as failed).
the clear next experiment: run the n23 side with "N3Initial 50".
The downloaders do not "give up" unless the socks connection is closed by the other end. Although, I now realize that its probably smart to have the client "give up" so as to avoid knocking it out of the simulation because its download hung. Doesn't Tor kill the AP connection if it can't get a circuit by a deadline?
Yes. But if it gets a circuit and starts getting bytes, it will never stop until it gets an 'end' cell from the other side. From the Tor client's perspective, it's got a TCP connection open and the other side has chosen not to send data lately, which is fine.
So if a significant fraction of the clients are stalling, that's either a Tor bug (e.g. "for some reason the end cell gets lost") or a simulation bug (no example guesses yet there). But I guess if the client is no longer adding load to the network, it's not necessarily the case that lots of the clients are stalling -- it could be a few of them right at the beginning, and then they just disappear from the simulation after that. That case would especially influence the rest of the simulation since the client load has changed.
So if a significant fraction of the clients are stalling, that's either a Tor bug (e.g. "for some reason the end cell gets lost") or a simulation bug (no example guesses yet there). But I guess if the client is no longer adding load to the network, it's not necessarily the case that lots of the clients are stalling -- it could be a few of them right at the beginning, and then they just disappear from the simulation after that. That case would especially influence the rest of the simulation since the client load has changed.
It looks like there is some throttling going on, since TTFB improves for both bulk and web. While bulk downloaders seem to take the majority of the punishment, unfortunately a big chunk of web downloaders take some as well.
We are again seeing problems with the clients, this time with the P2P clients. We should really focus on getting things right in #6341 (moved) before running more experiments here, since that patch is much simpler than this.
We should definitely re-run this, but I'm posting the current results anywway in case they are useful.
If it would be useful to have the N23Initial 50 tests work out, and you can get me stack traces, I can try to figure out what the issue is if you want.
src/or/connection_or.c: In function ‘connection_or_send_destroy’:src/or/connection_or.c:1889:14: warning: ‘return’ with no value, in function returning non-void [-Wreturn-type]
One of the configs is to use 'N3Initial 50'. I noticed that N3Min is 100 by default in config.c. Do we need to change N3Min if we set N3Initial below it?
Exciting. If you change the assert to an if (!conn) return, does it get farther?
I confess that the code needs a lot of refactoring and checking. I had hoped to be able to do that (or better, get Nick or Andrea to do that) in parallel with performance tests.
One of the configs is to use 'N3Initial 50'. I noticed that N3Min is 100 by default in config.c. Do we need to change N3Min if we set N3Initial below it?
No. N3Min is unused -- it's from the "adaptive N23" case, which we ripped out because it was so ugly.
I just pushed a new n23-4 branch. It should address the compile warning and the assert trigger. It's also rebased onto current master.
It remains the case that I don't know if the code "should" do the right thing, but it would still be nice to find out if it at least functions.
Karsten, if you're looking for more sims to run, this is a good one. You can run it with default config as a baseline (UseN23 defaults to 0). Then with
UseN23 1
as the main test.
And if that goes nicely,
UseN23 1N3Initial 50
to see if things change.
Since there are still good odds that something will break, if it's just as easy to run it on a smaller VM image first, that might make sense. But, whatever is easy.
Karsten, if you're looking for more sims to run, this is a good one. You can run it with default config as a baseline (UseN23 defaults to 0).
I just started this simulation with the default value for UseN23.
Since there are still good odds that something will break, if it's just as easy to run it on a smaller VM image first, that might make sense. But, whatever is easy.
The large VM is easier. I ran into problems with the small and tiny VMs last week which I was unable to track down. The large VM works just fine. Sticking with the large VM for now.
Karsten, if you're looking for more sims to run, this is a good one. You can run it with default config as a baseline (UseN23 defaults to 0).
I just started this simulation with the default value for UseN23.
I'm running into a problem, but it seems that master is to blame here, not the n23-4 branch. There's something wrong with bootstrapping. None of the relays or clients manage to download an up-to-date consensus and relay descriptors, and only 76 of the 100 exits (and interestingly, none of the non-exits or clients) manage to get the very first consensus published at 00:05:00. I stared at the scallion.log files for a bit, but couldn't see an obvious reason for the problem. I uploaded Shadow's data directories for the n23-4 and the master simulation (6.2M each). I could change the log level to info or debug if someone wants to look at those logs, and I could manually bisect Tor versions to see when Shadow started being unhappy. Not sure if it's relevant, but I had to tweak src/library/scallion/scallion-tor.c to use idkey = client_identitykey instead of idkey = identitykey to make Shadow compile.
The easiest way to get that is to just use Shadow master.
Switching to Shadow master seems to have fixed the problem. Started the simulation again. Will post results here once I have them, probably tomorrow morning. Thanks!
Switching to Shadow master seems to have fixed the problem. Started the simulation again. Will post results here once I have them, probably tomorrow morning. Thanks!
Simulation of the n23-4 branch with default torrcs completed successfully, AFAICS. The results are here. I started the other two configurations and will post results once I have them.
The other two simulations finished far faster than expected, but it looks like there were massive problems with finishing downloads. I'm attaching the PDF output as usual. I also looked at overall memory usage to see if there was a drop indicating that processes died, but memory usage looks okay. Then I plotted log messages indicating completed downloads which shows that far fewer downloads were completed, especially in the third case. I didn't find any unusual warnings or errors in the logs.