Opened 7 years ago

Closed 2 years ago

#8662 closed enhancement (wontfix)

Make Torperf log circuit failures

Reported by: karsten Owned by:
Priority: Medium Milestone:
Component: Archived/Torperf Version:
Severity: Keywords:
Cc: mikeperry, robgjansen Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Mike, Rob, and I briefly discussed at the dev meeting that we could make Torperf log circuit failures by adding REASON and REMOTE_REASON from CIRC events to Torperf's output whenever a request fails.

I can see how Torperf learns about circuit failures, but I'm yet unsure how to include this information in its output. Torperf outputs exactly one line per measurement. But if a circuit fails before attaching Torperf's measurement stream to it, we'll never learn about that circuit in the context of this measurement. And I assume this isn't just about circuits failing after they got a stream attached, right?

Here are a few options:

  • We don't include circuit failure information in Torperf's results, but log CIRC events locally to a separate file. These CIRC event log files won't be archived or made publicly available. This may work if we're only looking for changes in CIRC failures within the next few weeks or so.
  • We add a new field to Torperf's output collecting all CIRC failures since the last measurement. This is somewhat similar to the QUANTILE and TIMEOUT fields that contain values from the last received BUILDTIMEOUT_SET event.
  • We give up the one-output-line-per-measurement format and add further information to Torperf's output. This is a rather big data format change though. If we were to do this, we should also move QUANTILE and TIMEOUT to their own lines.

Thoughts?

Child Tickets

Change History (6)

comment:1 Changed 7 years ago by mikeperry

Hrmm.. We're likely to want to observe this data for some time, perhaps on the order of a year, at least. Looking at https://metrics.torproject.org/network.html#versions, it seems to take around 6 months or more for relays to migrate off of the previous stable version. If we assume client churn is similar, we're going to need to watch that data for at least that long (as we transition to ntor with 0.2.4.x).

Since reliability is tied closely to network security, I think we probably want to keep an eye on this data basically forever, just in case we hit some other load barrier that suddenly introduces lots of failure (like socket or TCP sourceport exhaustion).

This makes me think that if we do go the separate file route, we want to publish it, too... (It could be shorter than full CIRC event logs.. We only want the failures vs successes right now).

However, one option for per-line data is only listing the top 2-3 reason pairs so far, and then also list the remaining total under OTHER?

comment:2 Changed 7 years ago by karsten

Hmm, okay, let's give up the one-output-line-per-meausurement format. If we need this data for more than a few weeks, we should publish it. And while I see how we could aggregate circuit failures in a single new field for it, this isn't going to scale forever. I'm certain you'll want other circuit information in the future, and then we'd have this discussion again. So, let's screw the format we have and come up with something new. No better time than now to do this.

Here's a sample result line from a currently running 0.2.5.x Torperf (reformatted and annotated):

# Configuration
SOURCE=ec2
FILESIZE=51200

# Measurement results
START=1365473921.75
SOCKET=1365473921.75
CONNECT=1365473921.75
NEGOTIATE=1365473921.75
REQUEST=1365473921.75
RESPONSE=1365473922.07
DATAREQUEST=1365473922.07
DATARESPONSE=1365473922.40
DATACOMPLETE=1365473922.76
WRITEBYTES=82
READBYTES=51323
DIDTIMEOUT=0
DATAPERC10=1365473922.56
DATAPERC20=1365473922.64
DATAPERC30=1365473922.64
DATAPERC40=1365473922.68
DATAPERC50=1365473922.68
DATAPERC60=1365473922.72
DATAPERC70=1365473922.72
DATAPERC80=1365473922.75
DATAPERC90=1365473922.75

# Additional information
LAUNCH=1365473682.13
USED_AT=1365473922.76
PATH=$62680CF0743460E5F836F949E37A6DEC22622F9E,
     $11064D066F892DC38AAEFDA5EDAE1A227D07D182,
     $35F51DCE73B988CBAE06B06312B4D1271979FE3B
BUILDTIMES=0.53,1.10,1.32
TIMEOUT=4833
QUANTILE=0.800000
CIRC_ID=501
USED_BY=884

I could imagine a JSON format with three event types for file_download, circuit, and buildtimeout_set:

[
  {
    "type": "buildtimeout_set",
    "id": "1365473682.03",
    "timeout": 4833,
    "quantile": 0.800000
  },
  {
    "type": "circuit",
    "id": "1365473682.13_501",
    "launch": 1365473682.13,
    "used_at": 1365473922.76,
    "path": [ "$62680CF0743460E5F836F949E37A6DEC22622F9E",
              "$11064D066F892DC38AAEFDA5EDAE1A227D07D182",
              "$35F51DCE73B988CBAE06B06312B4D1271979FE3B" ],
    "buildtimes": [ 0.53, 1.10, 1.32 ],
    "circ_id": 501,
    "used_by": 884
  },
  {
    "type": "file_download",
    "id": "ec2_51200_1365473921.75",
    "source": "ec2",
    "filesize": "51200",
    "start": 1365473921.75,
    "socket": 1365473921.75,
    "connect": 1365473921.75,
    "negotiate": 1365473921.75,
    "request": 1365473921.75,
    "response": 1365473922.07,
    "datarequest": 1365473922.07,
    "dataresponse": 1365473922.40,
    "datacomplete": 1365473922.76,
    "writebytes": 82,
    "readbytes": 51323,
    "didtimeout": 0,
    "dataperc": { "10": 1365473922.56,
                  "20": 1365473922.64,
                  "30": 1365473922.64,
                  "40": 1365473922.68,
                  "50": 1365473922.68,
                  "60": 1365473922.72,
                  "70": 1365473922.72,
                  "80": 1365473922.75,
                  "90": 1365473922.75 },
    "circ_id": "1365473682.13_501",
    "buildtimeout_used": "1365473682.03"
  }
]

This is just a draft that may change during implementation. We could also add a separate event type for more detailed information about the stream used for the measurement.

We can easily add reasons for circuit failures, which would mostly be in "circuit" events that are not referenced from "file_download" events.

Anything else that comes to mind that we should add?

comment:3 Changed 6 years ago by karsten

Status: newneeds_information

mikeperry, what do you think about my suggestion above?

comment:4 Changed 6 years ago by mikeperry

JSON sounds like a good choice here for maximum extensibility and machine-readability.

Later down the road, we may want some info on stream reliability and timeout data, too. JSON will make it easier to add those, too.

However, one area where you'll need to take care is how to handle grouping multiple failures. Based on the above example, it looks like you intend all circuits related to a fetch to be encoded in the same top-level JSON object for the fetch? Or do you intend to have separate objects for each failure?

Related: How does Torperf currently handle streams that detach from a circuit and get retried on another circuit? This can happen for example if you get an EXITPOLICY or HOSTUNREACH relay reason back before establishment, or if you hit the stream timeout. Probably some other cases too.

comment:5 in reply to:  4 Changed 6 years ago by karsten

Replying to mikeperry:

JSON sounds like a good choice here for maximum extensibility and machine-readability.

Great.

Later down the road, we may want some info on stream reliability and timeout data, too. JSON will make it easier to add those, too.

What info on stream reliability and timeout data would that be? The better we know what data we plan to gather, the better we can design the file format.

However, one area where you'll need to take care is how to handle grouping multiple failures. Based on the above example, it looks like you intend all circuits related to a fetch to be encoded in the same top-level JSON object for the fetch? Or do you intend to have separate objects for each failure?

My original idea was to have separate objects for a) each file download including stream details, b) circuit (summary of CIRC events with same CircuitID), and c) BUILDTIMEOUT_SET event. These objects would all be on the top level and contain references to each other. In the example above, there's a field "circ_id": "1365473682.13_501" in the "file_download" event that refers to the "circuit" event.

But you're right, it's quite possible that we use more than one circuit in a single fetch. So far, whenever we detached a stream from a circuit and re-attached it to another circuit, we simply discarded the old circuit information. We shouldn't do that. Instead, we should keep a list of circuits to which a stream was attached, together with REASON and REMOTE_REASON information.

We'll want to have a separate event type for each stream that contains a summary of STREAM events with the same StreamID. This wasn't necessary so far, because we didn't really store anything about the stream we used for the measurement, but that is different now.

Here's an updated example. There could be several "circuit" events in this list, and the "stream" event could contain more than just one of them in its "circuits" field.

[
  {
    "type": "buildtimeout_set",
    "id": "1365473682.03",
    "timeout": 4833,
    "quantile": 0.800000
  },
  {
    "type": "circuit",
    "id": "1365473682.13_501",
    "launch": 1365473682.13,
    "path": [ "$62680CF0743460E5F836F949E37A6DEC22622F9E",
              "$11064D066F892DC38AAEFDA5EDAE1A227D07D182",
              "$35F51DCE73B988CBAE06B06312B4D1271979FE3B" ],
    "buildtimes": [ 0.53, 1.10, 1.32 ],
    "circ_id": 501,
    "buildtimeout_used": "1365473682.03"
  },
  { "type": "stream",
    "id": "1365473922.76_884",
    "circuits": [ { "circ_id": "1365473682.13_501",
                    "reason": "END",
                    "remote_reason": "DONE" } ]
  },
  {
    "type": "file_download",
    "id": "ec2_51200_1365473921.75",
    "source": "ec2",
    "filesize": "51200",
    "start": 1365473921.75,
    "socket": 1365473921.75,
    "connect": 1365473921.75,
    "negotiate": 1365473921.75,
    "request": 1365473921.75,
    "response": 1365473922.07,
    "datarequest": 1365473922.07,
    "dataresponse": 1365473922.40,
    "datacomplete": 1365473922.76,
    "writebytes": 82,
    "readbytes": 51323,
    "didtimeout": 0,
    "dataperc": { "10": 1365473922.56,
                  "20": 1365473922.64,
                  "30": 1365473922.64,
                  "40": 1365473922.68,
                  "50": 1365473922.68,
                  "60": 1365473922.72,
                  "70": 1365473922.72,
                  "80": 1365473922.75,
                  "90": 1365473922.75 },
    "stream": "1365473922.76_884"
  }
]

Related: How does Torperf currently handle streams that detach from a circuit and get retried on another circuit? This can happen for example if you get an EXITPOLICY or HOSTUNREACH relay reason back before establishment, or if you hit the stream timeout. Probably some other cases too.

(This should be handled by the new data format.)

Thanks for your feedback so far! This is very helpful.

comment:6 Changed 2 years ago by karsten

Resolution: wontfix
Status: needs_informationclosed

We officially switched from Torperf to OnionPerf for Tor performance measurements. We don't need these tickets anymore and will use OnionPerf's issue tracker for OnionPerf-related issues. Closing them all as wontfix.

Note: See TracTickets for help on using tickets.