Make Torperf log circuit failures

added component::archived/torperf priority::medium resolution::wontfix status::closed type::enhancement labels

Hrmm.. We're likely to want to observe this data for some time, perhaps on the order of a year, at least. Looking at https://metrics.torproject.org/network.html#versions, it seems to take around 6 months or more for relays to migrate off of the previous stable version. If we assume client churn is similar, we're going to need to watch that data for at least that long (as we transition to ntor with 0.2.4.x).

Since reliability is tied closely to network security, I think we probably want to keep an eye on this data basically forever, just in case we hit some other load barrier that suddenly introduces lots of failure (like socket or TCP sourceport exhaustion).

This makes me think that if we do go the separate file route, we want to publish it, too... (It could be shorter than full CIRC event logs.. We only want the failures vs successes right now).

However, one option for per-line data is only listing the top 2-3 reason pairs so far, and then also list the remaining total under OTHER?

Hmm, okay, let's give up the one-output-line-per-meausurement format. If we need this data for more than a few weeks, we should publish it. And while I see how we could aggregate circuit failures in a single new field for it, this isn't going to scale forever. I'm certain you'll want other circuit information in the future, and then we'd have this discussion again. So, let's screw the format we have and come up with something new. No better time than now to do this.

Here's a sample result line from a currently running 0.2.5.x Torperf (reformatted and annotated):

# Configuration
SOURCE=ec2
FILESIZE=51200

# Measurement results
START=1365473921.75
SOCKET=1365473921.75
CONNECT=1365473921.75
NEGOTIATE=1365473921.75
REQUEST=1365473921.75
RESPONSE=1365473922.07
DATAREQUEST=1365473922.07
DATARESPONSE=1365473922.40
DATACOMPLETE=1365473922.76
WRITEBYTES=82
READBYTES=51323
DIDTIMEOUT=0
DATAPERC10=1365473922.56
DATAPERC20=1365473922.64
DATAPERC30=1365473922.64
DATAPERC40=1365473922.68
DATAPERC50=1365473922.68
DATAPERC60=1365473922.72
DATAPERC70=1365473922.72
DATAPERC80=1365473922.75
DATAPERC90=1365473922.75

# Additional information
LAUNCH=1365473682.13
USED_AT=1365473922.76
PATH=$62680CF0743460E5F836F949E37A6DEC22622F9E,
     $11064D066F892DC38AAEFDA5EDAE1A227D07D182,
     $35F51DCE73B988CBAE06B06312B4D1271979FE3B
BUILDTIMES=0.53,1.10,1.32
TIMEOUT=4833
QUANTILE=0.800000
CIRC_ID=501
USED_BY=884

I could imagine a JSON format with three event types for file_download, circuit, and buildtimeout_set:

[
  {
    "type": "buildtimeout_set",
    "id": "1365473682.03",
    "timeout": 4833,
    "quantile": 0.800000
  },
  {
    "type": "circuit",
    "id": "1365473682.13_501",
    "launch": 1365473682.13,
    "used_at": 1365473922.76,
    "path": [ "$62680CF0743460E5F836F949E37A6DEC22622F9E",
              "$11064D066F892DC38AAEFDA5EDAE1A227D07D182",
              "$35F51DCE73B988CBAE06B06312B4D1271979FE3B" ],
    "buildtimes": [ 0.53, 1.10, 1.32 ],
    "circ_id": 501,
    "used_by": 884
  },
  {
    "type": "file_download",
    "id": "ec2_51200_1365473921.75",
    "source": "ec2",
    "filesize": "51200",
    "start": 1365473921.75,
    "socket": 1365473921.75,
    "connect": 1365473921.75,
    "negotiate": 1365473921.75,
    "request": 1365473921.75,
    "response": 1365473922.07,
    "datarequest": 1365473922.07,
    "dataresponse": 1365473922.40,
    "datacomplete": 1365473922.76,
    "writebytes": 82,
    "readbytes": 51323,
    "didtimeout": 0,
    "dataperc": { "10": 1365473922.56,
                  "20": 1365473922.64,
                  "30": 1365473922.64,
                  "40": 1365473922.68,
                  "50": 1365473922.68,
                  "60": 1365473922.72,
                  "70": 1365473922.72,
                  "80": 1365473922.75,
                  "90": 1365473922.75 },
    "circ_id": "1365473682.13_501",
    "buildtimeout_used": "1365473682.03"
  }
]

This is just a draft that may change during implementation. We could also add a separate event type for more detailed information about the stream used for the measurement.

We can easily add reasons for circuit failures, which would mostly be in "circuit" events that are not referenced from "file_download" events.

Anything else that comes to mind that we should add?

mikeperry, what do you think about my suggestion above?

Trac:
Status: new to needs_information

JSON sounds like a good choice here for maximum extensibility and machine-readability.

Later down the road, we may want some info on stream reliability and timeout data, too. JSON will make it easier to add those, too.

However, one area where you'll need to take care is how to handle grouping multiple failures. Based on the above example, it looks like you intend all circuits related to a fetch to be encoded in the same top-level JSON object for the fetch? Or do you intend to have separate objects for each failure?

Related: How does Torperf currently handle streams that detach from a circuit and get retried on another circuit? This can happen for example if you get an EXITPOLICY or HOSTUNREACH relay reason back before establishment, or if you hit the stream timeout. Probably some other cases too.

Replying to mikeperry:

JSON sounds like a good choice here for maximum extensibility and machine-readability.

Great.

Later down the road, we may want some info on stream reliability and timeout data, too. JSON will make it easier to add those, too.

What info on stream reliability and timeout data would that be? The better we know what data we plan to gather, the better we can design the file format.

However, one area where you'll need to take care is how to handle grouping multiple failures. Based on the above example, it looks like you intend all circuits related to a fetch to be encoded in the same top-level JSON object for the fetch? Or do you intend to have separate objects for each failure?

My original idea was to have separate objects for a) each file download including stream details, b) circuit (summary of CIRC events with same CircuitID), and c) BUILDTIMEOUT_SET event. These objects would all be on the top level and contain references to each other. In the example above, there's a field "circ_id": "1365473682.13_501" in the "file_download" event that refers to the "circuit" event.

But you're right, it's quite possible that we use more than one circuit in a single fetch. So far, whenever we detached a stream from a circuit and re-attached it to another circuit, we simply discarded the old circuit information. We shouldn't do that. Instead, we should keep a list of circuits to which a stream was attached, together with REASON and REMOTE_REASON information.

We'll want to have a separate event type for each stream that contains a summary of STREAM events with the same StreamID. This wasn't necessary so far, because we didn't really store anything about the stream we used for the measurement, but that is different now.

Here's an updated example. There could be several "circuit" events in this list, and the "stream" event could contain more than just one of them in its "circuits" field.

[
  {
    "type": "buildtimeout_set",
    "id": "1365473682.03",
    "timeout": 4833,
    "quantile": 0.800000
  },
  {
    "type": "circuit",
    "id": "1365473682.13_501",
    "launch": 1365473682.13,
    "path": [ "$62680CF0743460E5F836F949E37A6DEC22622F9E",
              "$11064D066F892DC38AAEFDA5EDAE1A227D07D182",
              "$35F51DCE73B988CBAE06B06312B4D1271979FE3B" ],
    "buildtimes": [ 0.53, 1.10, 1.32 ],
    "circ_id": 501,
    "buildtimeout_used": "1365473682.03"
  },
  { "type": "stream",
    "id": "1365473922.76_884",
    "circuits": [ { "circ_id": "1365473682.13_501",
                    "reason": "END",
                    "remote_reason": "DONE" } ]
  },
  {
    "type": "file_download",
    "id": "ec2_51200_1365473921.75",
    "source": "ec2",
    "filesize": "51200",
    "start": 1365473921.75,
    "socket": 1365473921.75,
    "connect": 1365473921.75,
    "negotiate": 1365473921.75,
    "request": 1365473921.75,
    "response": 1365473922.07,
    "datarequest": 1365473922.07,
    "dataresponse": 1365473922.40,
    "datacomplete": 1365473922.76,
    "writebytes": 82,
    "readbytes": 51323,
    "didtimeout": 0,
    "dataperc": { "10": 1365473922.56,
                  "20": 1365473922.64,
                  "30": 1365473922.64,
                  "40": 1365473922.68,
                  "50": 1365473922.68,
                  "60": 1365473922.72,
                  "70": 1365473922.72,
                  "80": 1365473922.75,
                  "90": 1365473922.75 },
    "stream": "1365473922.76_884"
  }
]

Related: How does Torperf currently handle streams that detach from a circuit and get retried on another circuit? This can happen for example if you get an EXITPOLICY or HOSTUNREACH relay reason back before establishment, or if you hit the stream timeout. Probably some other cases too.

(This should be handled by the new data format.)

Thanks for your feedback so far! This is very helpful.

We officially switched from Torperf to OnionPerf for Tor performance measurements. We don't need these tickets anymore and will use OnionPerf's issue tracker for OnionPerf-related issues. Closing them all as wontfix.

Trac:
Resolution: N/A to wontfix
Status: needs_information to closed

closed

Make Torperf log circuit failures

Child items ...

Activity