Opened 8 years ago

Closed 17 months ago

#4486 closed task (invalid)

Research: should N23 actually help in practice?

Reported by: arma Owned by: arma
Priority: Medium Milestone:
Component: Metrics/Analysis Version:
Severity: Normal Keywords: performance, flowcontrol, nickm-cares
Cc: robgjansen, tschorsch@…, karsten, iang, malsabah@…, adi, metrics-team Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

The N23 design (see http://freehaven.net/anonbib/#pets2011-defenestrator) looked promising based on the ExperimenTor simulations.

What do the Shadow simulations think about it?

What knobs should we expose to be able to tune it in practice? What are the recommended initial guesses for those knobs, and how sensitive are our expected performance results to changes in the values, or changes in network load?

Child Tickets

Attachments (12)

n23-2012-03-15.pdf (101.2 KB) - added by robgjansen 7 years ago.
n23 vs vanilla Tor performance
n23-2012-03-16.pdf (122.5 KB) - added by robgjansen 7 years ago.
added a n23 run configured with "N3Initial 50"
20120810-ec2-n23-combined.pdf (426.9 KB) - added by robgjansen 7 years ago.
client performance, N23, probably broken
authority.log.xz (96.2 KB) - added by robgjansen 7 years ago.
logfile exposing bug in arma's n23-3 branch
n23-4-vanilla.pdf (370.0 KB) - added by karsten 7 years ago.
Simulation of N23 (arma/n23-4, commit fb5fdbf), vanilla only
n23-4-combined.pdf (428.3 KB) - added by karsten 7 years ago.
task4486-mem-2012-10-09.png (33.0 KB) - added by karsten 7 years ago.
task4486-dl-2012-10-09.png (57.0 KB) - added by karsten 7 years ago.
task4486-dl-2012-10-11.png (121.1 KB) - added by karsten 7 years ago.
webclient999.log.gz (25.8 KB) - added by karsten 7 years ago.
20121025-apollo-n23-combined.pdf (728.8 KB) - added by robgjansen 7 years ago.
20121113-apollo-n23-heavy-combined.pdf (478.6 KB) - added by robgjansen 7 years ago.

Change History (63)

comment:1 in reply to:  description Changed 8 years ago by arma

Replying to arma:

What do the Shadow simulations think about it?

Also, has enough changed in ExperimenTor over the past almost-year to warrant a few confirmation runs?

comment:2 Changed 8 years ago by arma

Keywords: performance, flowcontrolperformance flowcontrol

comment:3 Changed 8 years ago by robgjansen

Cc: jansen@… added

comment:4 Changed 8 years ago by Flo

Cc: tschorsch@… added

comment:5 Changed 8 years ago by karsten

Parent ID: #4506

comment:6 Changed 7 years ago by arma

Ok, the n23-2 branch in my git repo (git://git.torproject.org/arma/tor) is the patch we have to work with (see #4488 for how we got it). You can run it with "UseN23 1" to turn on the new behavior. If you leave UseN23 off, it should act like a normal Tor.

Changed 7 years ago by robgjansen

Attachment: n23-2012-03-15.pdf added

n23 vs vanilla Tor performance

comment:7 Changed 7 years ago by robgjansen

Cc: robgjansen added; jansen@… removed

I've attached a first set of results. The Tor model is as described in #4086 (where relay capacities in Shadow are based on their reported observed bandwidth in Tor).

I ran a vanilla experiment, and an n23 experiment with the following:

UseN23 1
CircuitWindowSize 1000000
StreamWindowSize 1000000

Completed download counts may give us a sense of load on the network.
vanilla:
24148 320KiB (web)
74 5MiB (bulk)
n23:
28767 320KiB (web)
237 5MiB (bulk)

I added the "window size hack" to my configuration, but ended up using the git version where that was presumably not needed. Would this cause any problems?

comment:8 in reply to:  7 ; Changed 7 years ago by arma

Replying to robgjansen:

Completed download counts may give us a sense of load on the network.
vanilla:
24148 320KiB (web)
74 5MiB (bulk)
n23:
28767 320KiB (web)
237 5MiB (bulk)

So in this case the web downloads were a little bit slower, and the bulk downloads were a lot faster. And just as important, more of the downloads (especially bulk downloads) failed outright on the vanilla case. And most important of all, the code actually works.

What do you count as a failed download? Does your downloader cut off the attempt after 50 or 80 seconds or something? Do we have any sense of whether these failed downloads got 0 bytes or 'most of the bytes' before failing?

Mashael said, before seeing your results, "If N3Initial is small (50-70), it can throttle bulk downloaders and improve performance for browsers." So that's the clear next experiment: run the n23 side with "N3Initial 50". This means each circuit will have fewer cells in flight between relays (more accurately, sitting in the outbuf) at any given time.

(I also just noticed that ewma is off unless explicitly turned on. We should continue leaving it off for now, but we should keep in mind that at some point in the future we'll want to compare with 'on'.)

I added the "window size hack" to my configuration, but ended up using the git version where that was presumably not needed. Would this cause any problems?

It shouldn't.

Thanks!

comment:9 in reply to:  8 ; Changed 7 years ago by robgjansen

Replying to arma:

What do you count as a failed download? Does your downloader cut off the attempt after 50 or 80 seconds or something? Do we have any sense of whether these failed downloads got 0 bytes or 'most of the bytes' before failing?

The downloaders do not "give up" unless the socks connection is closed by the other end. Although, I now realize that its probably smart to have the client "give up" so as to avoid knocking it out of the simulation because its download hung. Doesn't Tor kill the AP connection if it can't get a circuit by a deadline?

I do not explicitly count failed downloads, but I do track bytes received over time. So we could count partial downloads that never completed, and infer failed downloads as partials that were started before the last X minutes of the simulation (to avoid counting legitimate in-progress downloads as failed).

the clear next experiment: run the n23 side with "N3Initial 50".

Its running now:)

comment:10 in reply to:  9 ; Changed 7 years ago by arma

Replying to robgjansen:

The downloaders do not "give up" unless the socks connection is closed by the other end. Although, I now realize that its probably smart to have the client "give up" so as to avoid knocking it out of the simulation because its download hung. Doesn't Tor kill the AP connection if it can't get a circuit by a deadline?

Yes. But if it gets a circuit and starts getting bytes, it will never stop until it gets an 'end' cell from the other side. From the Tor client's perspective, it's got a TCP connection open and the other side has chosen not to send data lately, which is fine.

So if a significant fraction of the clients are stalling, that's either a Tor bug (e.g. "for some reason the end cell gets lost") or a simulation bug (no example guesses yet there). But I guess if the client is no longer adding load to the network, it's not necessarily the case that lots of the clients are stalling -- it could be a few of them right at the beginning, and then they just disappear from the simulation after that. That case would especially influence the rest of the simulation since the client load has changed.

comment:11 in reply to:  10 Changed 7 years ago by arma

Replying to arma:

So if a significant fraction of the clients are stalling, that's either a Tor bug (e.g. "for some reason the end cell gets lost") or a simulation bug (no example guesses yet there). But I guess if the client is no longer adding load to the network, it's not necessarily the case that lots of the clients are stalling -- it could be a few of them right at the beginning, and then they just disappear from the simulation after that. That case would especially influence the rest of the simulation since the client load has changed.

I opened #5397 to focus on this question (since it came up in #5336 too).

Changed 7 years ago by robgjansen

Attachment: n23-2012-03-16.pdf added

added a n23 run configured with "N3Initial 50"

comment:12 in reply to:  9 Changed 7 years ago by robgjansen

Replying to robgjansen:

the clear next experiment: run the n23 side with "N3Initial 50".

Its running now:)

I just attached the results.

Download counts:
n23-n3initial50: 25841 320KiB (web), 64 5MiB (bulk)

It looks like there is some throttling going on, since TTFB improves for both bulk and web. While bulk downloaders seem to take the majority of the punishment, unfortunately a big chunk of web downloaders take some as well.

comment:13 Changed 7 years ago by robgjansen

I'm running an updated set of experiments on EC2 now.

Changed 7 years ago by robgjansen

client performance, N23, probably broken

comment:14 Changed 7 years ago by robgjansen

I just uploaded the client performance graphs. You can find the Tor network and client model described in#6401.

There are 3 simulations here, all done with tor-0.2.3.16-alpha. In the N23 sims, I merged 0.2.3.16-alpha into arma's n23-2 branch.

Load distribution in vanilla Tor:

TYPE	#XFERS	GiB	%
im	34735	0.033	0.075
web	85779	26.178	59.376
bulk	1586	7.744	17.565
p2p	596397	9.100	20.641
perf50k	1896	0.090	0.205
perf1m	965	0.942	2.138
TOTAL	721358	44.088	100.000

Load distribution with CircuitPriorityHalflife 0, UseN23 1:

TYPE	#XFERS	GiB	%
im	21581	0.021	0.050
web	77185	23.555	57.493
bulk	3323	16.226	39.603
p2p	42	0.001	0.002
perf50k	1686	0.080	0.196
perf1m	1114	1.088	2.655
TOTAL	104931	40.970	100.000

Load distribution with CircuitPriorityHalflife 30, UseN23 1:

TYPE	#XFERS	GiB	%
im	26429	0.025	0.064
web	73069	22.299	56.764
bulk	3265	15.942	40.583
p2p	8	0.000	0.000
perf50k	1672	0.080	0.203
perf1m	960	0.938	2.386
TOTAL	105403	39.284	100.000

We are again seeing problems with the clients, this time with the P2P clients. We should really focus on getting things right in #6341 before running more experiments here, since that patch is much simpler than this.

We should definitely re-run this, but I'm posting the current results anywway in case they are useful.

comment:15 in reply to:  14 Changed 7 years ago by robgjansen

Replying to robgjansen:

There are 3 simulations here, all done with tor-0.2.3.16-alpha. In the N23 sims, I merged 0.2.3.16-alpha into arma's n23-2 branch.

BTW, in addition to the above experiments, I also ran two with N23Initial 50. Both of them segfaulted.

comment:16 Changed 7 years ago by nickm

If it would be useful to have the N23Initial 50 tests work out, and you can get me stack traces, I can try to figure out what the issue is if you want.

comment:17 Changed 7 years ago by arma

My n23-3 branch is the latest we've got here.

Rob, now that the latest Shadow bug is solved, shall we give this one a go again?

comment:18 in reply to:  17 Changed 7 years ago by robgjansen

Replying to arma:

My n23-3 branch is the latest we've got here.

Rob, now that the latest Shadow bug is solved, shall we give this one a go again?

Sounds good. Should we continue with the model from #6401, or should we use a simpler version with only web and bulk clients?

comment:19 Changed 7 years ago by arma

Whichever you prefer. Slight preference for the simpler model, since the analysis might end up simpler.

comment:20 in reply to:  19 Changed 7 years ago by robgjansen

Replying to arma:

Whichever you prefer. Slight preference for the simpler model, since the analysis might end up simpler.

OK. I'll start some simulations using the large-m2.4xlarge model (distributed with Shadow) soon.

comment:21 Changed 7 years ago by robgjansen

Owner: set to robgjansen
Status: newassigned

I'm changing my mind and using the #6401 model. The reason is described here.

Spinning up simulations now.

Changed 7 years ago by robgjansen

Attachment: authority.log.xz added

logfile exposing bug in arma's n23-3 branch

comment:22 Changed 7 years ago by robgjansen

Replying to robgjansen:

I'm changing my mind and using the #6401 model. The reason is described here.

Spinning up simulations now.

Looks like we may have a bug here (using arma's n23-3 branch).

0:12:1:060570 [thread-0] 0:15:0:361828572 [scallion-error] [4uthority-201.2.0.0] [intercept_logv] [tor-err] BUG: connection_or_send_flowcontrol() src/or/connection_or.c:1907: connection_or_send_flowcontrol: Assertion conn failed; aborting.

I noticed this warning during the build:

src/or/connection_or.c: In function ‘connection_or_send_destroy’:
src/or/connection_or.c:1889:14: warning: ‘return’ with no value, in function returning non-void [-Wreturn-type]

I've attached the info-level logfile from the node that triggered the bug.

comment:23 Changed 7 years ago by robgjansen

Status: assignedneeds_information

comment:24 Changed 7 years ago by robgjansen

One of the configs is to use 'N3Initial 50'. I noticed that N3Min is 100 by default in config.c. Do we need to change N3Min if we set N3Initial below it?

comment:25 Changed 7 years ago by arma

Exciting. If you change the assert to an if (!conn) return, does it get farther?

I confess that the code needs a lot of refactoring and checking. I had hoped to be able to do that (or better, get Nick or Andrea to do that) in parallel with performance tests.

comment:26 in reply to:  24 Changed 7 years ago by arma

Replying to robgjansen:

One of the configs is to use 'N3Initial 50'. I noticed that N3Min is 100 by default in config.c. Do we need to change N3Min if we set N3Initial below it?

No. N3Min is unused -- it's from the "adaptive N23" case, which we ripped out because it was so ugly.

comment:27 Changed 7 years ago by arma

Cc: karsten iang added

I just pushed a new n23-4 branch. It should address the compile warning and the assert trigger. It's also rebased onto current master.

It remains the case that I don't know if the code "should" do the right thing, but it would still be nice to find out if it at least functions.

Karsten, if you're looking for more sims to run, this is a good one. You can run it with default config as a baseline (UseN23 defaults to 0). Then with

UseN23 1

as the main test.

And if that goes nicely,

UseN23 1
N3Initial 50

to see if things change.

Since there are still good odds that something will break, if it's just as easy to run it on a smaller VM image first, that might make sense. But, whatever is easy.

comment:28 in reply to:  27 ; Changed 7 years ago by karsten

Replying to arma:

Karsten, if you're looking for more sims to run, this is a good one. You can run it with default config as a baseline (UseN23 defaults to 0).

I just started this simulation with the default value for UseN23.

Since there are still good odds that something will break, if it's just as easy to run it on a smaller VM image first, that might make sense. But, whatever is easy.

The large VM is easier. I ran into problems with the small and tiny VMs last week which I was unable to track down. The large VM works just fine. Sticking with the large VM for now.

comment:29 in reply to:  28 ; Changed 7 years ago by karsten

Replying to karsten:

Replying to arma:

Karsten, if you're looking for more sims to run, this is a good one. You can run it with default config as a baseline (UseN23 defaults to 0).

I just started this simulation with the default value for UseN23.

I'm running into a problem, but it seems that master is to blame here, not the n23-4 branch. There's something wrong with bootstrapping. None of the relays or clients manage to download an up-to-date consensus and relay descriptors, and only 76 of the 100 exits (and interestingly, none of the non-exits or clients) manage to get the very first consensus published at 00:05:00. I stared at the scallion.log files for a bit, but couldn't see an obvious reason for the problem. I uploaded Shadow's data directories for the n23-4 and the master simulation (6.2M each). I could change the log level to info or debug if someone wants to look at those logs, and I could manually bisect Tor versions to see when Shadow started being unhappy. Not sure if it's relevant, but I had to tweak src/library/scallion/scallion-tor.c to use idkey = client_identitykey instead of idkey = identitykey to make Shadow compile.

comment:30 in reply to:  29 ; Changed 7 years ago by robgjansen

Replying to karsten:

Replying to karsten:

Replying to arma:

Karsten, if you're looking for more sims to run, this is a good one. You can run it with default config as a baseline (UseN23 defaults to 0).

I just started this simulation with the default value for UseN23.

I'm running into a problem, but it seems that master is to blame here, not the n23-4 branch. There's something wrong with bootstrapping.

If you're using Tor 0.2.4.x, you'll need the following Shadow fix, which didn't make it into the default release 1.6.0 that's part of the current EC2 image:
https://github.com/shadow/shadow/commit/1719e7e287743767f7d10895f15683a8f90bb865

The easiest way to get that is to just use Shadow master.

comment:31 in reply to:  30 ; Changed 7 years ago by karsten

Replying to robgjansen:

If you're using Tor 0.2.4.x, you'll need the following Shadow fix, which didn't make it into the default release 1.6.0 that's part of the current EC2 image:
https://github.com/shadow/shadow/commit/1719e7e287743767f7d10895f15683a8f90bb865

The easiest way to get that is to just use Shadow master.

Switching to Shadow master seems to have fixed the problem. Started the simulation again. Will post results here once I have them, probably tomorrow morning. Thanks!

Changed 7 years ago by karsten

Attachment: n23-4-vanilla.pdf added

Simulation of N23 (arma/n23-4, commit fb5fdbf), vanilla only

comment:32 in reply to:  31 Changed 7 years ago by karsten

Replying to karsten:

Switching to Shadow master seems to have fixed the problem. Started the simulation again. Will post results here once I have them, probably tomorrow morning. Thanks!

Simulation of the n23-4 branch with default torrcs completed successfully, AFAICS. The results are here. I started the other two configurations and will post results once I have them.

comment:33 Changed 7 years ago by arma

So far so good (that pdf looks plausible -- and it was supposed to be the control, so it should).

Changed 7 years ago by karsten

Attachment: n23-4-combined.pdf added

Changed 7 years ago by karsten

Attachment: task4486-mem-2012-10-09.png added

Changed 7 years ago by karsten

Attachment: task4486-dl-2012-10-09.png added

comment:34 Changed 7 years ago by karsten

The other two simulations finished far faster than expected, but it looks like there were massive problems with finishing downloads. I'm attaching the PDF output as usual. I also looked at overall memory usage to see if there was a drop indicating that processes died, but memory usage looks okay. Then I plotted log messages indicating completed downloads which shows that far fewer downloads were completed, especially in the third case. I didn't find any unusual warnings or errors in the logs.

Changed 7 years ago by karsten

Attachment: task4486-dl-2012-10-11.png added

Changed 7 years ago by karsten

Attachment: webclient999.log.gz added

comment:35 Changed 7 years ago by karsten

I'm thinking that the n23-4 branch has a problem with downloads exceeding a few hundred kilobytes. Some of these downloads never finish nor time out. The larger the file to be downloaded, the more likely to end up in such a situation. See this graph showing completed downloads by client type. Also see this info-level log of a webclient. Note how there's no progress after receiving 247707 bytes of download 2 (timestamp 0:53:34:295277).

comment:36 Changed 7 years ago by robgjansen

I was talking to John from UMN last week. He's been using a modified version of the n23-2 branch, after fixing some problems that prevented his torrent clients from working properly. IIRC, n23 only worked when downloading data (not uploading). He mentioned that the n23-3 was broken.

I'm currently trying to extract some code from him that actually works. Longer term, I think a re-design is in order.

Changed 7 years ago by robgjansen

comment:37 Changed 7 years ago by robgjansen

I just uploaded a new set of graphs, created using arma's n23-2 branch (after fixing the missing bracket). These were done with the large-m2.4xlarge distributed with Shadow.

The graphs generally show improvement for small files only when using 'N3Initial 50' as evident in all of the time-to-first-byte graphs. However, N23 seem to make things worse as more is downloaded through the circuit. The only time-to-last-byte improvement is for the 50KiB perf clients (who use a new circuit for every download), and again, only when using 'N3Initial 50'.

comment:38 in reply to:  36 ; Changed 7 years ago by arma

Replying to robgjansen:

I was talking to John from UMN last week. He's been using a modified version of the n23-2 branch, after fixing some problems that prevented his torrent clients from working properly. IIRC, n23 only worked when downloading data (not uploading). He mentioned that the n23-3 was broken.

I'm currently trying to extract some code from him that actually works.

I've pushed an update to my n23-4 branch that has John's patch in it.

comment:39 in reply to:  37 Changed 7 years ago by arma

Replying to robgjansen:

I just uploaded a new set of graphs, created using arma's n23-2 branch (after fixing the missing bracket). These were done with the large-m2.4xlarge distributed with Shadow.

The graphs generally show improvement for small files only when using 'N3Initial 50' as evident in all of the time-to-first-byte graphs. However, N23 seem to make things worse as more is downloaded through the circuit. The only time-to-last-byte improvement is for the 50KiB perf clients (who use a new circuit for every download), and again, only when using 'N3Initial 50'.

Does n23 look better when you load down the network a lot more, or when you make some client first links crappy (#4487)?

comment:40 in reply to:  38 Changed 7 years ago by robgjansen

Replying to arma:

Replying to robgjansen:

I was talking to John from UMN last week. He's been using a modified version of the n23-2 branch, after fixing some problems that prevented his torrent clients from working properly. IIRC, n23 only worked when downloading data (not uploading). He mentioned that the n23-3 was broken.

I'm currently trying to extract some code from him that actually works.

I've pushed an update to my n23-4 branch that has John's patch in it.

I've tested the n23-4 branch, and its broken :( Vanilla Tor works as expected, but clients are plagued with socks connection errors when n23 is turned on, so no one ever downloads anything. I guess that means we should stick with n23-2 going forward, until a new n23 patch is written in #5379.

Changed 7 years ago by robgjansen

comment:41 Changed 7 years ago by robgjansen

Status: needs_informationnew

Replying to arma:

Does n23 look better when you load down the network a lot more,

The 'heavy load' graphs are here. There were 1.5 times the number of clients as in the experiments that produced the previous 'normal load' set of graphs, though I noticed that the overall load was nowhere near 1.5 times. It again seems like N3Initial50 is better for first byte, as well as last byte of 50K downloads, but download times again worsen for 1M and 5M files.

I'm skeptical of the results given our low confidence that the N23 code actually does what we think it should do.

or when you make some client first links crappy (#4487)?

Running that now (despite my comment about code confidence), and will report results in #4487.

comment:42 Changed 7 years ago by Mashael

Cc: malsabah@… added

comment:43 Changed 6 years ago by Mashael

Your n23-4 branch had changes to the original code (mainly in command.c). I repaired them in my malsabah/n23-4 branch on git-crysp (Roger can pull from there). Now it compiles and the clients can download files.

comment:44 Changed 6 years ago by arma

Cc: adi added

comment:45 Changed 6 years ago by arma

Ok, I pushed this to the n23-5 branch in my arma repository.

It's git commit 76353ae65 for those trying to follow along at home.

comment:46 Changed 6 years ago by arma

next step is probably to try to rebase it from 0.2.4.3-alpha to maint-0.2.4. Which apparently has non-trivial conflicts because of the channel stuff.

comment:47 Changed 6 years ago by robgjansen

Owner: changed from robgjansen to arma
Status: newassigned

comment:48 Changed 2 years ago by nickm

Keywords: nickm-cares added

comment:49 Changed 21 months ago by teor

Severity: Normal

Set all open tickets without a severity to "Normal"

comment:50 Changed 17 months ago by irl

Cc: metrics-team added

Adding metrics-team to cc

comment:51 Changed 17 months ago by arma

Parent ID: #4506
Resolution: invalid
Status: assignedclosed

Yeah, ok. I'm going to close this one, mainly because #7346 is the main blocker for n23 analysis at this point, and "does the old code implement the design that we realized was wrong" is not the most pressing question anymore.

Note: See TracTickets for help on using tickets.