Opened 9 years ago

Closed 8 years ago

#3036 closed enhancement (implemented)

Tweak Torperf's .mergedata format and make it the new default

Reported by: karsten Owned by: karsten
Priority: Medium Milestone:
Component: Archived/Torperf Version:
Severity: Keywords:
Cc: mikeperry, Sebastian, rransom, arma Actual Points: 11
Parent ID: Points:
Reviewer: Sponsor:

Description

Right now, we have three Torperf data formats: the .data files containing the output of trivsocks-client.c, the .extradata files containing the output of the Python script attached to Tor's control port, and the .mergedata files containing the consolidation of the two formats.

I'd like to tweak the .mergedata format to make it easier to process, and I want to make it the new default Torperf output format.

Here's what I'd like to change:

  • Every data point in the new .mergedata format should contain the meta data that is necessary to generate Torperf graphs. This meta data contains the file size, the source (moria, siv, ferrinii, etc.), and possibly a custom guard choice and/or custom circuit build timeout. I could imagine adding these meta data as FILESIZE=51200, SOURCE=ferrinii, GUARDS=slowratio, CBT=75.

One motivation for this change is to remove the dependency from the filename, which is how we currently encode these meta data, e.g., slowratio75cbt-50kb.mergedata.

Also, I'd like to be able to concatenate multiple Torperf files and have a single file for a) the standard Torperf runs of a given month and b) the Torperf runs from a given experiment. This makes it easier for people to download and process our Torperf data.

  • We should combine the SEC and USEC fields and simply write timestamps as floats with a precision of, say, two decimal places, like we do in LAUNCH=1302523261.18. For example, STARTSEC=1302523501 STARTUSEC=703442 would become START=1302523501.70. This saves a lot of bytes and maybe even a few CPU cycles when parsing the single fields of a data point.
  • When measuring hidden service performance as in #1944, we should add custom fields for the various hidden service substeps, e.g., START_RENDCIRC, GOT_INTROCIRC, etc.

What do you think? Do these changes make sense? If so, here are the next steps:

  • The first step in this endeavor is to wait for the results of #2687 where we try to implement an efficient .mergedata parser in R.
  • The next step would be to change consolidate_stats.py to add the new meta data fields and combine SEC and USEC fields for us.
  • As soon as we have the new .mergedata format, I'll update metrics-db to aggregate the various Torperf files and prepare them for the metrics website. I'll also update metrics-web to parse the .mergedata format instead of the .data format. And of course, I'll update the Overview of Statistical Data in the Tor Network to describe the new format.
  • Once we start working on #2565, we might want to dump the .data and .extradata formats entirely and have Torperf only output the .mergedata format.

Child Tickets

Attachments (2)

siv-5mb.tpf.gz (349.8 KB) - added by karsten 8 years ago.
siv's 5 MiB Torperf data in the new format
Convert.java (7.7 KB) - added by karsten 8 years ago.
Java class to convert Torperf's .data and .extradata files to the .tpf format

Download all attachments as: .zip

Change History (6)

comment:1 Changed 8 years ago by karsten

Cc: mikeperry Sebastian rransom arma added

I'm picking up this ticket again, because I learned a few days ago that we were not archiving Torperf data correctly. Looks like we lost 2--4 months of siv's data. Oops.

While looking into the archiving problem I decided to work on the new Torperf data format which will be a lot easier to archive than the current format. As a positive side effect, the new format will be much easier to understand for non-core Torperf developers. I'm planning to archive only the new format and not archive the current formats in the future. So, the new format should contain all relevant information.

I realize that the Torperf rewrite won't happen anytime soon, so I'm going to implement the new Torperf format in metrics-db. Torperf will still generate the old formats, but metrics-db will convert the output to the new format. Whenever the Torperf rewrite happens it can output the new format itself.

The suggested new format is pretty much as described in this ticket. The basic idea is that there is a single line per Torperf run which is sufficient to learn about 1) the Tor and Torperf configuration, 2) measurement results, and 3) additional information that might help explain the results.

  1. Configuration
  • SOURCE: Configured name of the data source; required.
  • FILESIZE: Configured file size in bytes; required.
  • Other meta data describing the Tor or Torperf configuration, e.g., GUARD for a custom guard choice; optional.
  1. Measurement results
  • START: Time when the connection process starts; required.
  • SOCKET: Time when the socket was created; required.
  • CONNECT: Time when the socket was connected; required.
  • NEGOTIATE: Time when SOCKS 5 authentication methods have been negotiated; required.
  • REQUEST: Time when the SOCKS request was sent; required.
  • RESPONSE: Time when the SOCKS response was received; required.
  • DATAREQUEST: Time when the HTTP request was written; required.
  • DATARESPONSE: Time when the first response was received; required.
  • DATACOMPLETE: Time when the payload was complete; required.
  • WRITEBYTES: Total number of bytes written; required.
  • READBYTES: Total number of bytes read; required.
  • DIDTIMEOUT: 1 if the request timed out, 0 otherwise; optional.
  • Other measurement results, e.g., START_RENDCIRC, GOT_INTROCIRC, etc. for hidden-service measurements.
  1. Additional information
  • LAUNCH: Time when the circuit was launched; optional.
  • USED_AT: Time when this circuit was used; optional.
  • PATH: List of relays in the circuit, separated by commas; optional.
  • BUILDTIMES: List of times when circuit hops were built, separated by commas; optional.
  • TIMEOUT: Circuit build timeout that the Tor client used when building this circuit; optional.
  • QUANTILE: Circuit build time quantile that the Tor client uses to determine its circuit-build timeout; optional.
  • CIRC_ID: Circuit identifier of the circuit used for this measurement; optional.
  • USED_BY: Stream identifier of the stream used for this measurement; optional.
  • Other fields containing additional information; optional.

Note that two pieces of information from the current .extradata files are not included in the new Torperf data format:

  • Build timeout details: The current .extradata files contain the full BUILDTIMEOUT_SET events that were sent by Tor via its control port. They are not part of the new format, because they mostly explain why Tor picked a given circuit build timeout, where the timeout itself is already part of the new format. In theory, it would be possible to include some details of the last BUILDTIMEOUT_SET event that was received before a Torperf run was finished and written to the .extradata file.
  • Unused circuits: The .extradata files also contain information about circuits that were not used by Torperf. There's hardly any relation to the Torperf measurements, so they're left out. In theory, one could include aggregate information about the number of failed circuits before a Torperf run was finished and written to the .extradata file.

I understand that people may find the information that was left out here important. I could also imagine that people find other information important. We can't put all data that was generated while performing Torperf measurements in this format. We'd end up adding Tor's debug logs to the format. We should identify relevant information that is sufficient for most analyses. For example, I can be convinced to add single fields or aggregated data from the build timeout events or unused circuits. But if someone wants to analyze a specific aspect of Tor's performance, they'll need to keep Tor's logs or controller events in addition to the new Torperf data format.

Please find siv's 5 MiB Torperf data in the new format attached to this ticket as an example.

Changed 8 years ago by karsten

Attachment: siv-5mb.tpf.gz added

siv's 5 MiB Torperf data in the new format

Changed 8 years ago by karsten

Attachment: Convert.java added

Java class to convert Torperf's .data and .extradata files to the .tpf format

comment:2 Changed 8 years ago by karsten

The archiving issues are now solved in metrics-db with Torperf's old .data and .extradata formats. We should switch to the new format anyway.

I attached the Java class that I used to produce siv's 5 MiB Torperf data file in the new format (which is also attached). This class assumes that .data and .extradata files exist in the local working directory. It produces a single .tpf file for every pair of .data and .extradata files.

Next steps are:

  • Fix the TODOs in Convert.java and review that class in detail to be sure that all relevant parts of .data and .extradata files are contained in .tpf files.
  • Implement a mechanism to keep a state between executions to only read newly appended lines to .data and .extradata files. Watch out for edge cases, e.g., when we cannot merge a line in the current execution but maybe in the next execution.
  • Merge results into daily .tpf files instead of a huge single .tpf file. Don't require old output files to be around forever, because we'll want to archive them at least monthly.
  • Integrate the updated Convert.java into metrics-db's TorperfDownloader.java which downloads possibly truncated Torperf .data and .extradata files. That class should then output daily .tpf files.

comment:3 Changed 8 years ago by karsten

The "next steps" from my previous comment are now all done. New tarballs are available using the .tpf file format that is specified here. metrics-lib supports parsing the .tpf format. The .data and .extradata formats are still available, though they might go away in a few months.

What's left to do is provide the most recent .tpf files via rsync. That'll have to wait until the oldest .tpf files are 3+ days old, or we'll consider everything "recent." Will do this next week. Leaving the ticket open until that's done, too.

Another thing that's left to do but that may never happen is that Torperf outputs the .tpf format itself rather than having metrics-db produce it. We'll have to find a Torperf developer for that, though.

comment:4 in reply to:  3 Changed 8 years ago by karsten

Actual Points: 11
Resolution: implemented
Status: newclosed

Replying to karsten:

What's left to do is provide the most recent .tpf files via rsync. That'll have to wait until the oldest .tpf files are 3+ days old, or we'll consider everything "recent." Will do this next week. Leaving the ticket open until that's done, too.

Done. Closing.

Note: See TracTickets for help on using tickets.