Opened 7 months ago

Last modified 2 months ago

#33076 needs_review task

Graph consensus and vote information from Rob's experiments

Reported by: mikeperry Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Analysis Version:
Severity: Normal Keywords: metrics-team-roadmap-2020, sbws-roadmap
Cc: teor, gk, pastly Actual Points: 3
Parent ID: #33327 Points: 6
Reviewer: Sponsor:

Description (last modified by karsten)

This is a ticket for the work to graph the historical onionperf data from Rob's relay flooding experiment.

Some discussion threads:
https://lists.torproject.org/pipermail/tor-scaling/2019-December/000077.html
https://lists.torproject.org/pipermail/tor-scaling/2020-January/000081.html

Basically, we want to have a standard way to graph results from key metrics from before, during, and after the experiment.

In this case, we want CDF-TTFB, CDF-DL from onionperf results.

We also want CDF-Relay-Stream-Capacity and CDF-Relay-Utilization for the consensus, as well as from the votes, to see if the votes from TorFlow drastically differ from sbws during the experiment.

https://trac.torproject.org/projects/tor/wiki/org/roadmaps/CoreTor/PerformanceMetrics

Update from June 10, 2020: We finished the CDF-TTFB and CDF-DL portions by adding these graphs to OnionPerf's visualize mode. The remaining parts are the CDF-Relay-* graphs that are based on consensuses and votes. Keep this in mind when reading comments up to June 10, 2020.

Child Tickets

Change History (41)

Changed 7 months ago by karsten

comment:1 Changed 7 months ago by karsten

Okay, let's continue this discussion on this ticket.

I'm attaching new graphs with CDF-TTFB for all OnionPerfs running during that time.

These graphs are using colors from ColorBrewer that are supposed to be easier to distinguish for colorblind people, but we can still colors that as we move forward.

CDF-DL requires some more processing, and I'm yet not sure how to do the other two. I'll see when I get there. Posting updates as I have them.

comment:2 Changed 7 months ago by gaba

Keywords: metrics-team-roadmap-2020Q1 added

Changed 7 months ago by karsten

comment:3 Changed 7 months ago by karsten

This is a bit embarrassing, but the reason for the 50% bump was that I mixed public and onion server results... Fixed here! That file also contains CDF-DL. Will try to do the other two later tonight.

Changed 7 months ago by karsten

comment:4 Changed 7 months ago by karsten

The other two are about as hard as expected. I just finished a very first version of CDF-Relay-Utilization that I'm attaching here. Expect bugs!

I'll work on CDF-Relay-Stream-Capacity now, but only for consensuses for the moment; votes will have to wait.

comment:5 Changed 7 months ago by karsten

Status: newneeds_review

And here's another document with both CDF-Relay-* graphs; without votes. Expect bugs!

This might be a good point for you to provide feedback whether this is roughly going into the right direction. Setting to needs_review for this purpose. I'll pause working on this until I hear back. Thanks!

comment:6 Changed 7 months ago by teor

Cc: teor added

comment:7 Changed 7 months ago by mikeperry

Status: needs_reviewneeds_revision

Hrmm. CDF-Relay-Capacity should have a X axis range of [0.0, 1.0]. I just realized that there was some incorrect wording in the definition of the metric on https://trac.torproject.org/projects/tor/wiki/org/roadmaps/CoreTor/PerformanceMetrics. It should be average read/write history divided by peak observed bandwidth, for each relay in the network. In other words, you average the read/write history over time for a relay, and divide it by the peak advertised bandwidth over that period of time. This should produce a value between 0 and 1.

There is also a bug in the CDF-Relay-Stream-Capacity, though I am not sure what it is. It should be centered around 1.0, not 0.01. Can you write the formula you used for this? Perhaps you just forgot to include the scale multiplier for the measured bandwidth?

comment:8 Changed 7 months ago by karsten

I'll have to be AFK for the next couple hours, but until then here's a tiny subset of the data I used for the CDF-Relay-* graphs:

fingerprint,validafter,hasexitflag,hasguardflag,read,written,rate,burst,observed,measured
000A10D4,2019-08-13T09:00,f,f,451,286,102400,204800,112448,2
000C1F7C,2019-08-13T09:00,f,t,3578899,3591896,19660800,1073741824,8023040,27200
000CFDEC,2019-08-13T09:00,f,f,1057727,1057703,4194304,6291456,4117504,4360
0011BD24,2019-08-13T09:00,t,t,17419119,17556018,1073741824,1073741824,34104843,44200
001524DD,2019-08-13T09:00,f,t,2065558,2119272,5242880,10485760,5455106,9870
0020D8A2,2019-08-13T09:00,t,t,3686378,3740564,1073741824,1073741824,9030923,11000
00342C0E,2019-08-13T09:00,f,t,1015702,1020588,4194304,5242880,3717941,4500
003BFA1B,2019-08-13T09:00,f,t,6623297,6664212,26214400,1073741824,18836759,29800
003D7882,2019-08-13T09:00,t,f,1280946,1304984,1073741824,1073741824,5083154,3640
0041E015,2019-08-13T09:00,f,f,175940,176294,10485760,10485760,1102303,994
00451DF7,2019-08-13T09:00,f,f,539348,539309,4194304,6291456,2820096,3100
004573FE,2019-08-13T09:00,f,f,15712,15637,819200,1024000,480169,243
0059D929,2019-08-13T09:00,f,f,82583,82628,409600,1048576,450570,309
005ED972,2019-08-13T09:00,f,f,8810311,8828818,1073741824,1073741824,22108491,41000
00723AF1,2019-08-13T09:00,f,t,123404,124200,1073741824,1073741824,21775327,988
00727F3A,2019-08-13T09:00,f,f,699,698,512000,1024000,240510,2
0077BCBA,2019-08-13T09:00,t,t,21706519,22054066,1073741824,1073741824,38740943,63000
0077DDDD,2019-08-13T09:00,f,f,1049,892,13107200,14417920,1969480,3
008E7B70,2019-08-13T09:00,f,t,3060672,3082137,1073741824,1073741824,9123191,9860
0095792C,2019-08-13T09:00,f,f,324405,324452,1048576,1228800,963827,1250
00962D2D,2019-08-13T09:00,f,f,427773,427954,1048576,2097152,1150132,1200
009851DF,2019-08-13T09:00,f,f,553,391,131072,524288,172355,40
00B3C9FB,2019-08-13T09:00,f,f,1673,1096,102400,204800,119949,18
00CCE6A8,2019-08-13T09:00,t,f,355872,375660,1073741824,1073741824,11663560,35
00D2269D,2019-08-13T09:00,f,t,4218721,4238104,19660800,26214400,9658368,14400
00E1649E,2019-08-13T09:00,f,t,8092131,8137499,13107200,13107200,12435058,22100
00E89DDE,2019-08-13T09:00,f,t,1847757,1854884,1073741824,1073741824,6900558,10200
00FBC7DB,2019-08-13T09:00,f,f,83514,19986,307200,614400,970523,172

There's one line per consensus entry, which is where we get the following columns from: fingerprint,validafter,hasexitflag,hasguardflag,measured. The read and write columns come from bandwidth histories contained in extra-info descriptors. rate,burst,observed come from the server descriptor referenced by the consensus entry.

For CDF-Relay-Utilization I used (pseudo code):

read_write := (read + written) / 2;
advertised := min(rate, burst, observed)
plot -> advertised / read_write

For CDF-Relay-Stream-Capacity I used:

plot -> measured / observed

comment:9 in reply to:  8 Changed 7 months ago by mikeperry

Replying to karsten:

There's one line per consensus entry, which is where we get the following columns from: fingerprint,validafter,hasexitflag,hasguardflag,measured. The read and write columns come from bandwidth histories contained in extra-info descriptors. rate,burst,observed come from the server descriptor referenced by the consensus entry.

For CDF-Relay-Utilization I used (pseudo code):

read_write := (read + written) / 2;
advertised := min(rate, burst, observed)
plot -> advertised / read_write

Let's do:

 read_write := (read + written) / 2;
 peak := max(rate, burst, observed)
 plot -> read_write / peak

Ideally, peak would be that peak-in-30-days thing we grinded out in Whistler, but for this we actually wanna see what the instantaneous change to peak that the experiment caused did to results.

The of plot should still be between 0 and 1.0. Any relay that has a value over 1.0 would be very interesting to look at.

For CDF-Relay-Stream-Capacity I used:

plot -> measured / observed

Yes, I think this is just off by a factor of 1000 then. It should be:

plot -> 1000*measured / observed

And then the CDFs will be centered on 1.0.

comment:10 Changed 7 months ago by karsten

I attached a new set of graphs here. They are all cut off at percentile 95, and they all contain the plotted formula in the subtitle.

Regarding the max(rate, burst, observed) part, I'm worried that this number is not very meaningful. In theory, the operator can pick any numbers for rate and burst which the relay can never provide. I plotted one graph with that number, but I don't think we should use that.

The min(rate, burst, observed) number is what we typically use as advertised bandwidth. Maybe it's sufficient to ignore what the operator thought the relay could/should provide and look at observed bandwidth only. I included a plot for this, too.

I recall the peak advertised bandwidth thing we talked about in Whistler. It's significantly harder to compute than the current advertised (or observed) bandwidth, because we need to include lots of descriptors for that. We should pick a formula that we use for all experiments, not just for this one. Maybe we can start with the single value and leave it as a possible extension for the future to consider a moving window of 30 days.

comment:11 Changed 6 months ago by mikeperry

Ok I went and dug through the Tor and TorFlow source. For CDF-Relay-Stream-Capacity, we actually need:

  plot -> 1000*measured/min(rate, observed)

As min(rate,observed) is what TorFlow uses to multiply out the measured bw weights. It does not include the middle "burst" value in this minimum.

For CDF−Relay−Utilization, we def want:

 read_write := (read + written) / 2;
 peak := observed
 plot -> read_write / peak

I recall the peak advertised bandwidth thing we talked about in Whistler. It's significantly harder to compute than the current advertised (or observed) bandwidth, because we need to include lots of descriptors for that. We should pick a formula that we use for all experiments, not just for this one. Maybe we can start with the single value and leave it as a possible extension for the future to consider a moving window of 30 days.

Yeah.. so we really need this 30 day peak of the observed value, as that gets us closer to the true network capacity and utilization. Rob's experiment is useful exactly because it forces relays closer to their peak capacity. Long term, I think metrics should be computing these 30 day maxes continually and providing them as an auxillary csv or other data stream for graphs like these.

For this experiment, it is interesting to see the direct change from 5 day peak observed to Rob's new values, but if I had to pick only one graph, I would still prefer using 30 day peaks.

comment:12 Changed 6 months ago by karsten

Here are the new graphs, including the new formulas with peak observed bandwidths and including votes.

I can make smaller changes later today, but larger changes will have to wait until next week.

comment:13 Changed 6 months ago by mikeperry

Hrmm. The consensus graphs now all look good, but the votes are a real mystery. It almost looks like only 25% of relays are being measured in these votes?

Do 75% of relays in the network really have a 1.0 value for 1000 * measured / min_rate_observed (ie measured = min_rate_observed/1000)?

Or are the votes scales differently, and we're just clipping because upper end of the range is much greater than 1.0 for votes?

comment:14 Changed 6 months ago by gaba

Keywords: sbws-roadmap added

comment:15 Changed 6 months ago by gaba

Parent ID: #33121

The goal is to deploy sbws in all bw authorities. We need to fix critical bugs to do this.

comment:16 in reply to:  13 Changed 6 months ago by karsten

Status: needs_revisionneeds_review

Replying to mikeperry:

Hrmm. The consensus graphs now all look good, but the votes are a real mystery. It almost looks like only 25% of relays are being measured in these votes?

Do 75% of relays in the network really have a 1.0 value for 1000 * measured / min_rate_observed (ie measured = min_rate_observed/1000)?

Or are the votes scales differently, and we're just clipping because upper end of the range is much greater than 1.0 for votes?

The issue was that I used the Bandwidth value from both consensuses and votes, where I should have used the Measured value from votes. That also explains why the graphs were the same for all authorities. Fixed here.

comment:17 Changed 6 months ago by gaba

Actual Points: 2

Mikeperry

comment:18 Changed 6 months ago by gaba

Mikeperry: Can we close this ticket (if not when)? How much more work this will need from the metrics team?

comment:19 Changed 6 months ago by mikeperry

Gaba: This ticket is part of the work for Objective 3.1 of the MOSS proposal. I need tooling to produce graphs like this on a regular basis. We need them for the sbws transition eval (#33077). After that, we will need to set about turning these into scripts or metrics portal pages that can be used to produce this set of graphs on the regular.

It might make more sense to make this ticket and #33077 child tickets of a ticket for Objective 3.1 rather than the sbws transition itself. Is there a ticket for Objective 3.1 yet?

At a glance, the graphs we need are:

  1. onionperf-cdf-ttfb-2020-01-28a (CDF-TTFB and CDF-DL, with 95-100% tail zoom)
  2. cdf-relay-utilization-and-stream-capacity-2020-02-04 (CDF-Relay-Utilization and CDF-Relay-Stream-Capacity, plus votes)

However, I am still puzzling out the differences when we used peak_observed (the 02-04 graphs) vs observed (01-29b graphs). I need to think on that a bit more and may have another graph request.

For #33077, I think it would be helpful to group the votes in some way based on which dirauths are using sbws vs torflow. Not sure if that means overlay all the CDFs or just combine them, or do some additional work to take the median values of measured and min_rate_observed from each set and make one graph for TorFlow and one for sbws votes. On the one hand, we will want to pick an option that is likely to be useful for future experiments. On the other, we will need to really dig into what the network load balancing would look like under each system.

So more iteration is needed.. Since that vote analysis is specific to sbws eval, that iteration can happen in #33077.

comment:20 Changed 6 months ago by karsten

Regarding Objective 3.1 of the MOSS proposal, I'm afraid that the CDF-Relay-* graphs will be out of scope. We had to reduce the budget and were only able to include graphs based on OnionPerf measurements, not based on other Tor descriptors. We can (and should) plan to include CDF-Relay-* graphs in that tool in the future, but the coding will likely have to wait until we have funding for that part.

To give you an idea why the CDF-Relay-* graphs are harder to add than the OnionPerf-based onces: we're parsing 1.9M of compressed OnionPerf data for the CDF-TTFB and CDF-DL graphs, but 841M compressed tor descriptors for the CDF-Relay-* graphs. Making the last PDF I attached above kept my local workstation (1T NVMe, 64G RAM) busy for over an hour.

Regarding adding graphs to the metrics website, I don't think that we can add any CDFs there without making major changes to the underlying graphing engine. Again, to given an example, the uncompressed data that is graphed in the last PDF I attached has a size of 4.5G. This is way different from existing graphs on the metrics website. I think we're looking at a tool that developers will run locally with data downloaded from CollecTor, a local database, and enough CPU time and RAM available to chew on all the data.

This all shouldn't stop us from exploring possible graphs that we might need in the future. And I can make graphs like the last PDF for occasions like the torflow/sbws transition. They will just not be part of the tool that we'll have available once the MOSS proposal is done.

I'll move the torflow/sbws discussion over to #33077. If you have any requests for changing the graphs above, except for the votes graph, please comment on this ticket. And when you think we're done, please indicate that here, too. Thanks!

Changed 6 months ago by karsten

comment:21 Changed 6 months ago by karsten

Dennis brought up the questions whether failed measurements are included in the CDF-TTFB graph or not. They are not right now, but we might consider including them with TTFB=Inf. The result would be that the curve doesn't go to 100% when there are failed attempts. That would give us an idea of when to expect the first byte in x% of measurement attempts. However, a possible downside might be that different time periods are harder to compare when there was a higher rate of failures in one period. On the other side, maybe that's useful information.

I attached two variants of CDF-TTFB, one containing all successful measurements as before and one containing all measurements including the failed ones. The difference is really small in this case, visible for example in the op-us onion graph. It might be more visible in other cases.

Leaving this here for discussion. This might be a non-standard way of using ECDFs and therefore harder to understand and possibly harder to make with matplotlib. But if there's agreement that it would be good to have, we should try to make these graphs.

comment:22 Changed 6 months ago by karsten

Actual Points: 23
Points: 6

Estimating 5 points for making these graphs plus 1 point for cleaning up the code and providing it in metrics-tasks.git. Also estimating that we already spent 3 of these points for making and discussing attached graphs.

comment:23 Changed 6 months ago by dennis.jackson

Using Karsten's code and dataset, I took another look at the onionperf measurements from over the course of Rob's experiment.

Time to First Byte

This graph shows TTFB measurements over non-onion circuits from op-ab. The period in which the experiment was active is highlighted in red and the period in which the bandwidth values remained in the consensus is in orange. The lines show the 5th percentile, 50th percentile (median), the mean and the 95th percentile. Each point on the line is an average over the previous 24 hours of measurements.

https://dennisjj.co.uk/tor-bandwidth-experiment/ttfb_public_op-ab.png

The increase in congestion over the experimental window appears to be substantial, with the 95 percentile ttfb rising from under 2 seconds to over 10 seconds. The trend is much flatter for onion circuits.

If we look at the raw data:

https://dennisjj.co.uk/tor-bandwidth-experiment/raw/exit_op-ab.png

We see a band at the 10 second line. Karsten has already noted this in #31521 and his investigation suggests it happens when streams timeout from one circuit and get reattached to another. I wonder if some exit nodes could not keep up with the additional bandwidth leading to timeouts?

Bandwidth Measurements

I have concerns about the reliability of the current dataset for the DL graphs. It contains only 2000 measurements over the month and none of the DL measurements appear to have finished successfully (have a non-null value for getDataPercentiles().get(100)). However, I might be misunderstanding onionperfs output format.

Further Thoughts

  • op-nl and op-usa reported no measurements between the 14th and the 16th of August, which makes it hard to use their data for this experiment.
  • There are a total of 10k ttfb measurements for the month of August. When you consider how many guard/middle/exit combinations there are, this really isn't that much data to work with. For example, it would have been interesting to drill down to how TTFB changed on the exit nodes whose weighting changed the most during this experiment, but there just aren't enough samples.
  • A full set of graphs, Jupyter Notebook with Python code and csv of the raw data: https://drive.google.com/open?id=1q1JRP5RdPEhQcDddh7KEST1aADMoiyt_
Last edited 6 months ago by dennis.jackson (previous) (diff)

comment:24 Changed 6 months ago by karsten

Thanks, dennis_jackson, for the great input!

I like your percentiles graph with the moving 24 hour window. We should include that graph type in our candidate list for graphs to be added to OnionPerf's visualization mode. Is that moving 24 hour window a standard visualization, or did you further process the data I gave you?

Regarding the dataset behind bandwidth measurements, I wonder if we should kill the 50 KiB downloads in deployed OnionPerfs and only keep the 1 MiB and 5 MiB downloads. If we later think that we need time-to-50KiB, we can always obtain that from the tgen logs. The main change would be that OnionPerfs consume more bandwidth and also put more load on the Tor network. The effect for graphs like these would be that we'd have 5 times as many measurements.

But I think (and hope) that you're wrong about measurements not having finished. If DATAPERC100 is non-null that actually means that the measurement reached the point where it received 100% of expected bytes. See also the Torperf and OnionPerf Measurement Results data format description.

It's quite possible that op-nl and op-us did not report measurements during the stated days. We have a reliability problem with the deployed OnionPerfs, which is why we included work on better notifications and easier deployment in our funding proposal. But we should also keep in mind that the main purpose of the currently deployed OnionPerfs is to have a baseline over the years. If we're planning experiments like this in the future we might want to spin up a couple OnionPerfs and watch them much more closely for a week or two.

Are you sure about that 10k ttfb measurements number for the month of August? In theory, every OnionPerf instance should make a new measurement every 5 minutes. That's 12*24*31 = 8928 measurements per instance in August, or 8928*4 = 35712 measurements performed by all four instances in August. So, okay, not quite 10k, but also not that many more. We should spin up more OnionPerf instances as soon as it has become easier to operate them. What's a good number to keep running continuously, in your opinion? 10? 20? And maybe we should consider deploying more than 1 instance per host or data center, so that we have more measurements with comparable network properties.

To summarize, we have a new candidate visualization, a best practice to set up additional OnionPerfs when running experiments, and suggestions to kill 50 KiB measurements and to deploy more OnionPerf instances. Does this make sense?

comment:25 in reply to:  24 Changed 6 months ago by dennis.jackson

Replying to karsten:

24 Hour Moving Average

I like your percentiles graph with the moving 24 hour window. We should include that graph type in our candidate list for graphs to be added to OnionPerf's visualization mode. Is that moving 24 hour window a standard visualization, or did you further process the data I gave you?

At a high level: I'm loading the data into Pandas and then using the rolling function to compute statistics for a window. It's pretty flexible supports different weighting strategies for the window, but I used 'uniform' here. The code is contained in the python notebook I linked at the end of my post.

Excerpt:

time_period = 60*60*24
threshold = 10
p95 = lambda x : x.rolling(f'{time_period}s',min_periods=threshold).dl.quantile(0.95)

The resulting data can be plotted as a time series in your graphing library of choice :).

Measuring Latency

Regarding the dataset behind bandwidth measurements, I wonder if we should kill the 50 KiB downloads in deployed OnionPerfs and only keep the 1 MiB and 5 MiB downloads. If we later think that we need time-to-50KiB, we can always obtain that from the tgen logs. The main change would be that OnionPerfs consume more bandwidth and also put more load on the Tor network. The effect for graphs like these would be that we'd have 5 times as many measurements.

I think that is definitely worth thinking about as 50 KB does seem too small to infer anything about bandwidth. It is maybe worth considering the cost of circuit construction though. For example, if we open a circuit for latency measurement, we could use Arthur's strategy of fetching HEAD only and maybe it is worth using that circuit for a series of measurements over a couple of minutes which would give us more reliable "point in time" data without any additional circuit construction overhead.

August Measurement Success Rate

But I think (and hope) that you're wrong about measurements not having finished. If DATAPERC100 is non-null that actually means that the measurement reached the point where it received 100% of expected bytes. See also the Torperf and OnionPerf Measurement Results data format description.

You are quite right! I looked back at my code and whilst I was correctly checking DATAPERC100 is non-null to imply success, I also found a trailing } which captured my check in the wrong if clause. My bad! Rerunning with the fix shows only 29 measurements failed to finish in August. Much much healthier!

Number of Measurements in August

Are you sure about that 10k ttfb measurements number for the month of August? In theory, every OnionPerf instance should make a new measurement every 5 minutes. That's 12*24*31 = 8928 measurements per instance in August, or 8928*4 = 35712 measurements performed by all four instances in August. So, okay, not quite 10k, but also not that many more. We should spin up more OnionPerf instances as soon as it has become easier to operate them.

Sorry, this was sloppy and incorrect wording on my part: "month of August" -> "Experimental period from August 4th - August 19th". There are 15k attempted measurements in this window, however op-hk did not achieve any successful connections and consequently only ~10k successful measurements in the dataset.

How many is enough?

What's a good number to keep running continuously, in your opinion? 10? 20? And maybe we should consider deploying more than 1 instance per host or data center, so that we have more measurements with comparable network properties.

I think it would be worth pulling Mike (congestion related) and the network health team (#33178) in and thinking about this in terms of output statistics rather than measurements input. Possible Example:

  • For a given X {minute,hour,day} period, we want to measure for {any circuit, circuits using this guard, circuits using this exit}, {probability of time out, p5-p50-p95 latency, p5-p50-95 bandwidth} with a 90% confidence interval less than {1%, 500ms, 500 KB/s}

This gives us a rolling target in terms of measurements we want to make, varying on network conditions and how fine grained we would like the statistics to be for a given time period. We could estimate the number of samples required (using the existing datasets) for each of these statistics, put in the cost per measurement and work out what is feasible for long term monitoring and short term experiments.

comment:26 Changed 6 months ago by gaba

Parent ID: #33121#33327

Moving it to other project for tracking.

comment:27 Changed 6 months ago by gk

Cc: gk added

comment:28 Changed 6 months ago by gaba

Sponsor: Sponsor59

comment:29 Changed 4 months ago by pastly

Cc: pastly added

comment:30 Changed 3 months ago by karsten

Sponsor: Sponsor59Sponsor59-must

Moving this to Sponsor59-must, because they have been important prerequisites for working on the other Sponsor59 tasks.

comment:31 Changed 2 months ago by gaba

Keywords: metrics-team-roadmap-2020 added; metrics-team-roadmap-2020Q1 removed

Changed 2 months ago by karsten

comment:32 Changed 2 months ago by karsten

Description: modified (diff)
Sponsor: Sponsor59-must
Summary: Graph onionperf and consensus information from Rob's experimentsGraph consensus and vote information from Rob's experiments

This is quite the ticket so far, with lots of attachments and comments. Time to join the various threads and summarize what's left to do:

The first part is CDF-TTFB and CDF-DL graphs:

  • OnionPerf's visualize mode are very soon going to support CDF-TTFB and CDF-DL graphs. The only remaining piece is the #33257 review, but I don't expect major changes there except for maybe bug fixes. I attached the output of op-ab's measurements before/during/after the experiment discussed in this ticket.
  • Pages 1 and 2 show CDF-TTFB for the public/onion service cases. Note that, in theory, I could have added 6 more lines to these graphs by adding 6 more data sets to the visualize command. It's unclear how readable the graph would have been, so I decided against it.
  • Pages 32 and 33 show CDF-DL for the public/onion service cases.
  • We briefly discussed Dennis' question whether failed measurements are included in the CDF-TTFB graph or not. They are not, and even though it would be possible to include them in the CDF-TTFB graphs as Inf values and in the CDF-DL graphs as -Inf values, I'm not convinced that it's a good idea. If failure rates differ a lot between the data sets we'll see that in the error graphs. If there's disagreement about this case, I'd like us to create a new ticket and discuss this topic there.
  • Dennis added a cool graph showing op-ab's TTFB over time as rolling 24 hour values. It showed quite well how the 95th percentile grows to over 10 seconds during the experiment and drops to much smaller values after the experiment. However, the same thing can be seen in the scatter plot on page 3 of the PDF I just attached. It wouldn't be hard to add another graph like Dennis', also because we're now using pandas just like Dennis did for this graph. I'm just not sure whether it's worth the additional effort. I'd say if somebody wants to have this graph and ideally provide a patch, let's open a new ticket for that enhancement.
  • This concludes the work on the CDF-TTFB and CDF-DL graphs. If we need more graphs containing OnionPerf measurement data, let's open new tickets for them.

The other remaining part is CDF-Relay-* graphs:

  • We don't have these graphs in OnionPerf, because we'll need Tor directory data in order to make them, and that's not available in OnionPerf yet. We also said that these graphs are out of scope for Sponsor 59.
  • Mike said that he's still puzzling out the differences when we used peak_observed (the 02-04 graphs) vs observed (01-29b graphs). He said he needs to think on that a bit more and may have another graph request.
  • I'm changing this ticket to discuss these remaining graphs. This includes updating the summary and description and removing the sponsor tag. I'll leave it in needs_review for mikeperry to comment on the difference mentioned in the previous bullet point. I could as well have opened a new ticket, but that would have meant carrying over a lot of context from this ticket, and that seemed like a lot of work.

If anything else remains to be done, please comment here or open a new ticket for that. Thanks, everyone!

Note: See TracTickets for help on using tickets.