Opened 8 years ago

Closed 5 years ago

#3260 closed task (fixed)

Learn client speed trends by evaluating directory request download times

Reported by: karsten Owned by: karsten
Priority: Medium Milestone:
Component: Metrics/Analysis Version:
Severity: Keywords: performance bootstrap
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

The directory request statistics that we collect since mid-2009 contain data that we didn't evaluate yet: network status download times. We also know the network status sizes from the directory archives, so we could infer connection client speeds. There may also be other things we could do with the data.

Child Tickets

Attachments (2)

client-speed-trends.csv (93.0 KB) - added by karsten 7 years ago.
client-speed-trends.png (92.1 KB) - added by karsten 7 years ago.

Download all attachments as: .zip

Change History (9)

comment:1 Changed 8 years ago by arma

Good idea, especially to see if we can find trends on client speeds. I'd suggest the next step is to figure out why it's too hard for somebody else to grab the data and do the analysis. Learning that lesson might help us get more analysts on board.

comment:2 Changed 8 years ago by arma

Priority: minornormal

comment:3 Changed 8 years ago by arma

Summary: Evaluate directory request download timesLearn client speed trends by evaluating directory request download times

comment:4 Changed 8 years ago by arma

Keywords: performance bootstrap added

Changed 7 years ago by karsten

Attachment: client-speed-trends.csv added

Changed 7 years ago by karsten

Attachment: client-speed-trends.png added

comment:5 Changed 7 years ago by karsten

Owner: set to karsten
Status: newassigned

Replying to arma:

Good idea, especially to see if we can find trends on client speeds.

Here we go.

Directory mirrors learn client bandwidths from measuring the time that clients take to download a network status consensus. Directory mirrors report various percentiles (min, max, median, quartiles, deciles) of client bandwidths in dirreq-v3-tunneled-dl lines in their extra-info descriptors. We can aggregate these percentiles and come up with percentiles of the overall client bandwidth in the network.

I looked at the statistics reported by directory mirrors since late 2009. Back then, only few relays reported dirreq statistics. So we have rather low-quality data until late 2010 when we explicitly asked operators to turn on statistics. In 2011, statistics have been enabled by default to the effect that we have really good data since late 2011. But despite the rather low data quality in 2009 and 2010 we should be able to observe basic trends.

While aggregating statistics, I discarded 50% of statistics lines on a given date by number of reported completed downloads. The idea is to exclude statistics from directory mirrors which didn't have enough bandwidth themselves to serve fast clients. I also discarded dates with fewer than 10 directory mirrors reporting dirreq-v3-tunneled-dl lines.

Here's the result:


The median client bandwidth was roughly between 100 and 150 KiB/s in the past 2.5 years with no clear long-term trend upwards or downwards. The 10th percentile was roughly between 10 and 25 KiB/s.

(Note that colours are less visible in the left half of the graph on purpose. In theory, the data base here is too thin to be plotted at all. I left the lines in to visualize that there's no dramatic trend change, but other than that we shouldn't conclude too much from the data before end of 2010, or even better late 2011.)

The raw data behind the graph and the parsing and graphing code are available, too.

comment:6 in reply to:  1 Changed 7 years ago by karsten

Replying to arma:

I'd suggest the next step is to figure out why it's too hard for somebody else to grab the data and do the analysis. Learning that lesson might help us get more analysts on board.

A fine question. The analysis question (this ticket) was on Trac for 10 months without anybody picking it up. The analysis above is based on data specified in dir-spec.txt, explained on the Formats page, and publicly available on the Data page. My parser class has 150 rather quickly written lines of Java code and uses the Java metrics descriptor library to do the parsing; a simple Python program would have worked, too. The plotting code is standard R and ggplot2; this could have been done with gnuplot et al., too. Coding, testing, and analyzing took me 7 hours, parsing took 5 hours and maxed out I/O on my Core 2 Duo with 8 GB RAM.

So, what do you think is the resource that's least accessible to potential analysts?

comment:7 Changed 5 years ago by karsten

Resolution: fixed
Status: assignedclosed

Oh, hey, this ticket has results, and it hasn't seen discussion in over 2 years. Time to close it.

Note: See TracTickets for help on using tickets.