Tweak Torperf's .mergedata format and make it the new default
Right now, we have three Torperf data formats: the .data files containing the output of trivsocks-client.c, the .extradata files containing the output of the Python script attached to Tor's control port, and the .mergedata files containing the consolidation of the two formats.
I'd like to tweak the .mergedata format to make it easier to process, and I want to make it the new default Torperf output format.
Here's what I'd like to change:
- Every data point in the new .mergedata format should contain the meta data that is necessary to generate Torperf graphs. This meta data contains the file size, the source (moria, siv, ferrinii, etc.), and possibly a custom guard choice and/or custom circuit build timeout. I could imagine adding these meta data as
FILESIZE=51200, SOURCE=ferrinii, GUARDS=slowratio, CBT=75
.
One motivation for this change is to remove the dependency from the filename, which is how we currently encode these meta data, e.g., slowratio75cbt-50kb.mergedata
.
Also, I'd like to be able to concatenate multiple Torperf files and have a single file for a) the standard Torperf runs of a given month and b) the Torperf runs from a given experiment. This makes it easier for people to download and process our Torperf data.
-
We should combine the SEC and USEC fields and simply write timestamps as floats with a precision of, say, two decimal places, like we do in
LAUNCH=1302523261.18
. For example,STARTSEC=1302523501 STARTUSEC=703442
would becomeSTART=1302523501.70
. This saves a lot of bytes and maybe even a few CPU cycles when parsing the single fields of a data point. -
When measuring hidden service performance as in #1944 (closed), we should add custom fields for the various hidden service substeps, e.g.,
START_RENDCIRC
,GOT_INTROCIRC
, etc.
What do you think? Do these changes make sense? If so, here are the next steps:
-
The first step in this endeavor is to wait for the results of #2687 (closed) where we try to implement an efficient .mergedata parser in R.
-
The next step would be to change
consolidate_stats.py
to add the new meta data fields and combine SEC and USEC fields for us. -
As soon as we have the new .mergedata format, I'll update metrics-db to aggregate the various Torperf files and prepare them for the metrics website. I'll also update metrics-web to parse the .mergedata format instead of the .data format. And of course, I'll update the Overview of Statistical Data in the Tor Network to describe the new format.
-
Once we start working on #2565 (closed), we might want to dump the .data and .extradata formats entirely and have Torperf only output the .mergedata format.