Opened 8 years ago

Closed 7 years ago

#5336 closed task (implemented)

Do simulations of initial proposal 182 patch

Reported by: arma Owned by:
Priority: Medium Milestone:
Component: Metrics/Analysis Version:
Severity: Keywords:
Cc: robgjansen, kevin, tschorsch@… Actual Points:
Parent ID: #4682 Points:
Reviewer: Sponsor:

Description

We have a proposed patch in #4712 for proposal 182 (parent ticket #4682). The patch is missing some pieces (for example it appears to do the wrong thing when RelayBandwidthRate is set), but I think under constrained circumstances a simulation should still give us some intuition about whether the patch is on the right track assuming the issues in #5334 are non-issues.

Child Tickets

Attachments (7)

task5336-2012-03-15.pdf (114.5 KB) - added by robgjansen 8 years ago.
shadow simulation results for task 5336
20120808-ec2-creditbuckets-combined.pdf (456.2 KB) - added by robgjansen 7 years ago.
client performance, credit caps
task5336-bwrate-2012-10-03.png (48.3 KB) - added by karsten 7 years ago.
task5336-bwburst-2012-10-03.png (55.1 KB) - added by karsten 7 years ago.
task5336-bwburst-abs-2012-10-05.png (37.5 KB) - added by karsten 7 years ago.
task5336-mem-2012-10-05.png (36.2 KB) - added by karsten 7 years ago.
task5336-combined.pdf (817.9 KB) - added by karsten 7 years ago.

Download all attachments as: .zip

Change History (33)

comment:1 Changed 8 years ago by arma

The "task5336a" branch in my git repo (git://git.torproject.org/~arma/git/tor) is vanilla master as of today.

The "task5336b" branch is the credit bucket patch, using the huge 10MB credit cap.

And the "task5336c" branch uses a more conservative credit cap (making it equal to our bandwidthburst).

It would be great to see comparisons between these three.

comment:2 Changed 8 years ago by arma

While I'm at it: Rob/Kevin, when you set the BandwidthRate and BandwidthBurst for your simulated relays, do you pick the smallest number out of the descriptor and set both rate and burst to that number? Or do you pull out both the Rate and the Burst and use them?

I imagine a simulated network that never has any extra space in its token buckets could behave quite differently from the real Tor network (where the fast relays often have significant cushion).

comment:3 Changed 8 years ago by Flo

Cc: robgjansen kevin tschorsch@… added; robgjansen kevin removed

comment:4 Changed 8 years ago by kevin

With ExperimenTor, I pull out both the BandwidthRate and BandwidthBurst options when sampling routers from a live Tor network configuration.

comment:5 in reply to:  4 Changed 8 years ago by robgjansen

Replying to kevin:

With ExperimenTor, I pull out both the BandwidthRate and BandwidthBurst options when sampling routers from a live Tor network configuration.

Shadow also uses both the BandwidthRate and BandwidthBurst from the server descriptors.

comment:6 in reply to:  1 Changed 8 years ago by arma

Replying to arma:

The "task5336a" branch in my git repo (git://git.torproject.org/~arma/git/tor) is vanilla master as of today.

Kevin and I realized yesterday that the correct url is "git://git.torproject.org/arma/tor". The one I first mentioned won't work.

comment:7 Changed 8 years ago by arma

I just merged master into the task5336{a,b,c} branches and pushed new versions of them. That way they include the fix for #5373.

Changed 8 years ago by robgjansen

Attachment: task5336-2012-03-15.pdf added

shadow simulation results for task 5336

comment:8 Changed 8 years ago by robgjansen

I've attached a first set of results. The Tor model is as described in #4086 (where relay capacities in Shadow are based on their reported observed bandwidth in Tor).

Each of the task{a,b,c} branches were run directly, adding only configs needed for my private test network.

Completed download counts may give us a sense of load on the network.
taska: 9482 320KiB (web), 43 5MiB (bulk)
taskb: 27635 320KiB (web), 188 5MiB (bulk)
taskc: 20076 320KiB (web), 201 5MiB (bulk)

Is there a reason that the taska counts should be so low (I usually expect somewhere in the 20k range for web download counts)? Did something change in a recent version of Tor? Or should I look closer at the logs and rerun taska?

comment:9 in reply to:  8 Changed 8 years ago by arma

Replying to robgjansen:

Is there a reason that the taska counts should be so low (I usually expect somewhere in the 20k range for web download counts)? Did something change in a recent version of Tor? Or should I look closer at the logs and rerun taska?

I opened #5397 to focus on this question (since it came up in #4486 too).

comment:10 Changed 7 years ago by arma

Rob: On #5397 you mention you aren't experiencing the issue anymore. Does that mean you have useful graphs for this ticket now? :)

comment:11 in reply to:  10 Changed 7 years ago by robgjansen

Replying to arma:

Rob: On #5397 you mention you aren't experiencing the issue anymore. Does that mean you have useful graphs for this ticket now? :)

I will when #6401 is no longer blocking simulation work.

comment:12 Changed 7 years ago by robgjansen

Shadow simulations are now running on EC2.

Changed 7 years ago by robgjansen

client performance, credit caps

comment:13 Changed 7 years ago by robgjansen

I just uploaded a graph of client performance. See #6401 for a description of the Tor network model and the client model.

There are 3 experiments here, all run with tor-0.2.3.16-alpha:

Load distribution for vanilla Tor:

TYPE	#XFERS	GiB	%
im	34735	0.033	0.075
web	85779	26.178	59.376
bulk	1586	7.744	17.565
p2p	596397	9.100	20.641
perf50k	1896	0.090	0.205
perf1m	965	0.942	2.138
TOTAL	721358	44.088	100.000

Load distribution for 10 MiB credit cap:

TYPE	#XFERS	GiB	%
im	21724	0.021	0.058
web	55965	17.079	47.710
bulk	2530	12.354	34.509
p2p	347047	5.296	14.793
perf50k	1517	0.072	0.202
perf1m	1000	0.977	2.728
TOTAL	429783	35.798	100.000

Load distribution for bandwidthrate credit cap:

TYPE	#XFERS	GiB	%
im	7217	0.007	0.031
web	29426	8.980	40.783
bulk	2379	11.616	52.755
p2p	48498	0.740	3.361
perf50k	841	0.040	0.182
perf1m	651	0.636	2.887
TOTAL	89012	22.019	100.000

It looks like the credit caps are reducing overall network load, mostly from the web clients. Bulk load seems to be increasing. The effect seems greater with smaller credit caps.

comment:14 Changed 7 years ago by nickm

Are we able to say anything about the patch's effects on latency, memory usage, and whether nodes actually obey their bandwidth limits with the patch in place?

comment:15 Changed 7 years ago by arma

I have the same question as I had for #6341: both credit cap cases get their last byte faster than vanilla, but they end up doing fewer transfers. What's up with that?

comment:16 in reply to:  15 Changed 7 years ago by robgjansen

Replying to arma:

I have the same question as I had for #6341: both credit cap cases get their last byte faster than vanilla, but they end up doing fewer transfers. What's up with that?

I don't know enough about what the credit cap thing is supposed to be doing here to answer this. Can you give any intuition to whether or not you would expect this to happen given the desired functionality here? And/or can you explain what the patch does briefly?

Also, note that a separate vanilla run was done in #6401 where the load mostly agrees with the vanilla run here. So is it reasonable to say the patch is causing the behavior?

comment:17 in reply to:  14 Changed 7 years ago by robgjansen

Replying to nickm:

Are we able to say anything about the patch's effects on latency, memory usage, and whether nodes actually obey their bandwidth limits with the patch in place?

For each Tor node we can track CPU utilization, memory, and input/output bytes (though I may have to clean up some loose ends in this Shadow ticket). I believe this will allow us to address your concerns, but I am not sure what you mean by latency.

I'd have to do additional experiments with this feature turned on for the relays. Is it same to assume this is desired?

comment:18 Changed 7 years ago by robgjansen

  1. Print the heartbeat message every second instead of every minute with $ scallion --heartbeat-frequency=1 …
  2. The heartbeat message will contain the number of bytes each nodes sends and receives per second. Match that up with the relay bandwidth limits to determine if nodes are actually obeying their bandwidth limits. You probably have to either modify the parse() function in analyze.py, or write a new script for this.

The per-node memory tracking is not working yet in Shadow, so we'll only be able to say things about overall memory consumption by looking at the data/dstat.log file.

Changed 7 years ago by karsten

Changed 7 years ago by karsten

comment:19 Changed 7 years ago by karsten

Status: newneeds_review

Replying to robgjansen:

  1. Print the heartbeat message every second instead of every minute with $ scallion --heartbeat-frequency=1 …
  2. The heartbeat message will contain the number of bytes each nodes sends and receives per second. Match that up with the relay bandwidth limits to determine if nodes are actually obeying their bandwidth limits. You probably have to either modify the parse() function in analyze.py, or write a new script for this.

Done. I wrote my own script and made two graphs: the first graph compares bandwidth rates to median bandwidths, and the second graph compares bandwidth bursts to 99th percentiles. For me it looks like all three branches respect bandwidth rates quite well and do not respect bandwidth bursts as much as they should. I do not see major differences between the three branches. I wonder if there's a better way to visualize this.

The per-node memory tracking is not working yet in Shadow, so we'll only be able to say things about overall memory consumption by looking at the data/dstat.log file.

I have the three dstat.log files. What do I do with them?

comment:20 in reply to:  19 Changed 7 years ago by robgjansen

Replying to karsten:

Replying to robgjansen:

  1. Print the heartbeat message every second instead of every minute with $ scallion --heartbeat-frequency=1 …
  2. The heartbeat message will contain the number of bytes each nodes sends and receives per second. Match that up with the relay bandwidth limits to determine if nodes are actually obeying their bandwidth limits. You probably have to either modify the parse() function in analyze.py, or write a new script for this.

Done. I wrote my own script and made two graphs: the first graph compares bandwidth rates to median bandwidths, and the second graph compares bandwidth bursts to 99th percentiles. For me it looks like all three branches respect bandwidth rates quite well and do not respect bandwidth bursts as much as they should. I do not see major differences between the three branches. I wonder if there's a better way to visualize this.

It may make sense that the amount sent on the wire is slightly more than the bandwidth 99th percentile bandwidth sent in Tor ( b/c control packets, packet header overheads, etc, are included in the amount sent on the wire but not in Tor's limits).

The per-node memory tracking is not working yet in Shadow, so we'll only be able to say things about overall memory consumption by looking at the data/dstat.log file.

I have the three dstat.log files. What do I do with them?

I believe the first few lines contain header info that explains the format of the csv. One of the columns has a timestamp and another has the system memory usage. You should be able to draw a memory-over-time plots with those two columns, and compare each branch in the same graph. (note that this is total system memory usage, so this would only work if nothing else is consuming memory on these machines - which should be the case if you used EC2)

comment:21 in reply to:  19 Changed 7 years ago by robgjansen

Replying to karsten:

Replying to robgjansen:

  1. Print the heartbeat message every second instead of every minute with $ scallion --heartbeat-frequency=1 …
  2. The heartbeat message will contain the number of bytes each nodes sends and receives per second. Match that up with the relay bandwidth limits to determine if nodes are actually obeying their bandwidth limits. You probably have to either modify the parse() function in analyze.py, or write a new script for this.

Done.

Also, can you attach the performance graphs for this set of runs?

Changed 7 years ago by karsten

Changed 7 years ago by karsten

Attachment: task5336-mem-2012-10-05.png added

comment:22 Changed 7 years ago by karsten

Replying to robgjansen:

It may make sense that the amount sent on the wire is slightly more than the bandwidth 99th percentile bandwidth sent in Tor ( b/c control packets, packet header overheads, etc, are included in the amount sent on the wire but not in Tor's limits).

Makes sense. I attached another graph that shows cumulative fractions of the differences between 99th percentile and bandwidth burst. That graph shows that there's hardly any difference between the three branches.

I believe the first few lines contain header info that explains the format of the csv. One of the columns has a timestamp and another has the system memory usage. You should be able to draw a memory-over-time plots with those two columns, and compare each branch in the same graph. (note that this is total system memory usage, so this would only work if nothing else is consuming memory on these machines - which should be the case if you used EC2)

Okay, I attached a graph for system memory usage, too. All three branches were run in newly created EC2 instances. I can't spot any difference between the branches.

Also, can you attach the performance graphs for this set of runs?

I didn't make any performance graphs yet. Making them now. Will attach them once I have them.

Changed 7 years ago by karsten

Attachment: task5336-combined.pdf added

comment:23 in reply to:  22 Changed 7 years ago by karsten

Replying to karsten:

Replying to robgjansen:

Also, can you attach the performance graphs for this set of runs?

I didn't make any performance graphs yet. Making them now. Will attach them once I have them.

Attached.

comment:24 Changed 7 years ago by robgjansen

To sum things up:

  • it appears the node's are actually obeying the bandwidth limits
  • there is no noticeable degradation in system performance (memory and CPU usage)
  • client performance wins out in taskb (the 10 MiB cap)

Did I miss something?

comment:25 in reply to:  24 Changed 7 years ago by karsten

Status: needs_reviewneeds_information

Replying to robgjansen:

To sum things up:

  • it appears the node's are actually obeying the bandwidth limits
  • there is no noticeable degradation in system performance (memory and CPU usage)
  • client performance wins out in taskb (the 10 MiB cap)

Did I miss something?

Your conclusion looks about right. Does that mean we're done with this ticket and can close it?

comment:26 Changed 7 years ago by arma

Resolution: implemented
Status: needs_informationclosed

We've definitely done some simulations.

I remain skeptical about the results though -- not because I think they're wrong, but I think because we don't have a good handle on what exactly is going wrong.

In particular, I wonder if further answers to #5398 would change our opinion here.

But this ticket does answer the "does it break or obviously go bad" question with a negative. Closing.

Note: See TracTickets for help on using tickets.