Opened 9 months ago

Closed 6 months ago

#29772 closed enhancement (fixed)

Add bandwidth graph with median, quartiles, and lowest bandwidth within 1.5 IQR of lower quartile

Reported by: karsten Owned by: metrics-team
Priority: High Milestone:
Component: Metrics/Website Version:
Severity: Normal Keywords: scalability
Cc: metrics-team Actual Points:
Parent ID: #29507 Points:
Reviewer: irl Sponsor:

Description

We have been asked to add graphs on (nearly) worst-case performance of our OnionPerf measurements, in addition to the average-case performance graphs we already have. In particular, we were asked to plot latency and bandwidth numbers. This ticket is about bandwidth numbers. It's based on team-internal discussions in Brussels and follow-up discussions.

With OnionPerf we measure download times for 50 KiB/1 MiB/5 MiB files that we download from our own public web server or onion server. We could use our DATAPERC* timestamps to extract how long it takes to download a specific part of our files and use that to compute average bandwidth.

We'd like to exclude the transfer start with all the circuit establishment and TCP slow start stuff and only focus on the parts when things have stabilized. More precisely, we could look at the 5 MiB downloads and consider only the time between finishing 2.5 MiB of it as well as the full 5 MiB. Or we could look at the time between downloading 0.5 MiB to 1 MiB that we have data for from our 1 MiB and 5 MiB downloads.

The ask was to plot nearly worst-case bandwidth. So, my guess is that we shouldn't plot the minimum, because we'd only be looking at outliers, but instead the 1st or 5th or 10th percentile. Let's maybe start with the 1st percentile.

I'm attaching two graphs for the public server case and the onion server case. They both show the respective 1st percentile bandwidth of successful 1 MiB and 5 MiB downloads on a given day.

The coding and deployment effort for bringing this graph on the Tor Metrics website would be comparatively small, because we already have all required data in the database. However, I'm not attaching a patch yet, because I'd first want to discuss the general idea of having such a graph.

Child Tickets

Attachments (4)

onionperf-bandwidth-public.png (172.1 KB) - added by karsten 9 months ago.
onionperf-bandwidth-onion.png (181.7 KB) - added by karsten 9 months ago.
onionperf-boxplots-annotated-2019-04-17.pdf (770.9 KB) - added by karsten 8 months ago.
onionperf-bandwidth-public-2019-05-25.png (287.6 KB) - added by karsten 7 months ago.

Download all attachments as: .zip

Change History (22)

Changed 9 months ago by karsten

Changed 9 months ago by karsten

comment:1 Changed 9 months ago by karsten



comment:2 Changed 8 months ago by karsten

Status: newneeds_review

comment:3 Changed 8 months ago by gaba

Keywords: scalability added

comment:4 Changed 8 months ago by irl

Status: needs_reviewneeds_revision

I'm not sure that 1st percentile is the right way to do this. Can we instead exclude minor/major outliers, those that are slower than 1st quartile minus 1.5/3x interquartile range, and then taken the minimum? How does this change the way the plot looks?

I don't think we should make a public graph on Tor Metrics for it, but can you also do a box plot for a month of measurements so I can understand just how variable the results are? I don't think I've done that before.

Changed 8 months ago by karsten

comment:5 Changed 8 months ago by karsten

Status: needs_revisionneeds_review

Great idea! I made some new graphs and annotated them by hand. Please find them attached. And please take another look. Thanks!

comment:6 Changed 8 months ago by irl

Status: needs_reviewneeds_revision

For the bandwidths, I think that plotting the minimum of the minor outliers is OK, we're not excluding many measurements there and we can see what the nearly-worst-case is.

I think for latency, we have to accept that Tor as it currently operates is going to have wildly varying latency depending on the path you choose. There is currently no way of selecting a "low-latency" path and as we increase relay diversity we're going to see these latencies go up. In a way, higher latency may indicate greater network diversity.

For bandwidth, choosing one high bandwidth server compared to another isn't going to affect the measurement result. When the network stays the same, we are likely to choose a similar set of relays throughout the day (or at least, the same consensus weight distribution). For latency, there is no consideration so we could be picking relays all over the place, or could be picking them all close together.

The absolute worst cases though, I think we are not hitting that often. Plotting the latency per-day may be the wrong approach because we may not have enough data points to accurately portray the true latency users experience. Perhaps we need a 4-day moving average, at which point some of those major outliers are becoming minor outliers and we can plot the maximum minor outliers.

As a data point, when we see a 2000ms latency, that is long enough to get a packet through optic fiber to the moon from the earth (not including the time to run the fiber, probably with special-purpose rockets). There might be some old routing/switching equipment near to relays that is causing impact there because this can't just be distance between relays.

comment:7 Changed 8 months ago by karsten

Status: needs_revisionneeds_review

I should start this comment by saying that I'm not a statistician. In case of doubt what I'm saying below, please go re-read this first sentence! :)

I agree with you that the bandwidth plot works better than the latency plot. We're excluding very few bandwidth numbers as outliers as compared to the number of latency numbers that we're throwing out.

However, I don't think that a 4-day moving average would fix this. As you can see in the boxplots I posted here last week, medians and quartiles are relatively stable over the days, and those values are what we're using to figure out if another value is excluded as outlier. After all, we have around 144 latency values per day and public/onion service. So, even if we considered 4 days (or even more) at a time, our threshold for excluding values as outliers would not change much. Of course, implementing such a moving average wouldn't be trivial to do, with all the missing data that we have to handle.

I think the issue is that the way we're excluding outliers is based on the assumption that our data is normally distributed. This works okay for bandwidth, which is obviously not 100% correct, because there's no negative bandwidth, but which is apparently close enough. It doesn't work very well for latencies, because there's some heavy-tailed distribution at work that we don't know, and not all the values we're excluding are really outliers.

Another reason could be that we're looking at the smallest bandwidth values, which are at the head of the distribution, and at the largest latency values, which are the heavy tail.

However, my suggestion is to ignore all this and make the plots as you suggested earlier and as I plotted them last week. Two reasons:

  1. Boxplots are understood by many people, and if we say that we're plotting the five values from boxplots, people will understand what we're doing.
  1. We need a baseline, even if it's not 100% correct in a mathematical/statistical sense. If our way to exclude outliers is flawed, it will be flawed for past measurements as well as for future measurements, in the exact same way.

Regarding your rocket analogy: it's certainly not just distance between relays that we're seeing here. We're also seeing overfull queues keeping received cells waiting for crypto and forwarding to the next relay. But this is fine, we want to know how long it takes to send something over the circuit and get back a response.

So, my suggestion would be to move forward with what we have. What do you think?

comment:8 Changed 8 months ago by irl

Status: needs_reviewnew

I think for now, moving forward with what we have is OK.

We might want to add a note to the description when we add this graph that this is an experimental one, subject to change without notice, or even disappearing without notice. I don't know how many plots we might be asked to make but we shouldn't be introducing them all with the same change procedures as we have for our other graphs just yet.

comment:9 in reply to:  8 Changed 8 months ago by karsten

Replying to irl:

I think for now, moving forward with what we have is OK.

Great!

We might want to add a note to the description when we add this graph that this is an experimental one, subject to change without notice, or even disappearing without notice. I don't know how many plots we might be asked to make but we shouldn't be introducing them all with the same change procedures as we have for our other graphs just yet.

Good point. We don't have good procedures in place for graphs. We do have a policy of announcing changes to the per-graph CSV files two weeks in advance. Having something like that, plus a good plan for archiving old graphs, would be good. Will think more about this!

comment:10 Changed 7 months ago by karsten

Parent ID: #29507

Adding this new graph is part of the larger task to evaluate existing OnionPerf data regarding worst-case performance.

comment:11 Changed 7 months ago by karsten

Summary: Plot nearly worst-case bandwidth when downloading from [public|onion] serverAdd bandwidth graph with median, quartiles, and lowest bandwidth within 1.5 IQR of lower quartile

Updating the summary based on the discussion above to reflect our plan.

I could imagine plotting the lowest bandwidth within 1.5 IQR of lower quartile as another solid line that happens to be outside of the IQR ribbon. Or maybe it has to be a dashed or dotted line or a solid line using a "lighter" color. This is mostly a note to myself. I'll add sample graphs for discussion as soon as I have them.

Changed 7 months ago by karsten

comment:12 Changed 7 months ago by karsten

Status: newneeds_review

Here's a new graph based on my idea above:


Please take a look!

comment:13 Changed 7 months ago by irl

Status: needs_reviewneeds_revision

The graph looks good and with a bit of text I think it can be easily understood. I really like this boxplot-style approach, it gives a really good overview of the measurements.

comment:14 Changed 7 months ago by irl

(needs_revision for the patch, or you can set back to new)

comment:15 Changed 6 months ago by karsten

Priority: MediumHigh
Reviewer: irl
Status: needs_revisionneeds_review

While preparing the patch I stumbled across the question how we want to call this graph. The current name is "Bandwidths", which seems rather generic, in particular if it's supposed to uniquely identify this graph within all other Tor Metrics graphs. How about we call it:

  • "Circuit bandwidths",
  • "Average bandwidths",
  • "Measured bandwidths",
  • a combination of any of these, or
  • something even better?

The other graph names in the Performance category are:

  • "Time to download files over Tor",
  • "Timeouts and failures of downloading files over Tor",
  • "Circuit build times", and
  • "Circuit round-trip latencies".

Please review commit 508258f in my task-29772 branch.

Setting priority to high to make it into tomorrow's Review Monday. Thanks!!

comment:16 Changed 6 months ago by irl

Status: needs_reviewneeds_revision

Bandwidth might be the wrong term to use altogether, perhaps this is "throughput" or even "goodput" to be more precise, though "throughput" is probably more widely understood. To make it generally understood we could even call it "speed test".

The patch looks good except for the descriptions, where I think we should talk about throughput rather than bandwidth. A poorly tuned implementation may have a ton of bandwidth available but never allow you to take advantage of it.

I think "Speed test" might make the graph really accessible because many people have run a speed test at some point to check out their own Internet connection.

comment:17 Changed 6 months ago by karsten

Status: needs_revisionmerge_ready

Sounds good. I changed the wording from "bandwidth" to "throughput" as the term that made most sense to me, concluding from your comment above that this term would work for you, too. I merged the updated patch to master and deployed the first part that produces a new .csv file once per day. As soon as that's available tomorrow evening I'll update the website and make the graph available. Thanks!

comment:18 Changed 6 months ago by karsten

Resolution: fixed
Status: merge_readyclosed

The update run succeeded without issues, and I deployed the website changes this morning. Closing. Thanks again!

Note: See TracTickets for help on using tickets.