Metrics Definitions

The following metrics are meant to be used in performance and scalability tuning, development, and research. We are attempting to capture a representative baseline, as well as a consistent data visualization methodology, and ensure that we have are aware of what metrics require new data collection to produce.

Our metrics are broken down into four categories: Latency, throughput, capacity, and reliability.

Latency Metrics

  • CDF-TTFB: Cumulative distribution function of the time-to-first-byte of a 5MB download. See page 2 of this report.
    • A good CDF-TTFB should look like a cliff (very little performance variance in times) and this cliff should be close to the origin of the graph (very fast response times overall).
    • A bad CDF-TTFB will look like a long, slow climb (high variance in performance and lots of slow results), and be very far from the origin of the graph (slow overall/average case performance).
  • CDF-RTT: This is the CDF of round trip times to an HTTP Request/response echo server.
    • XXX: Aggregating this metric and graphing it over time seems challenging, especially since we want to capture how individual circuits change over time.

Throughput Metrics

  • CDF-TTLB: Cumulative distribution function of the time-to-last-byte of a 5MB download. See page 3 of this report
    • Good and bad results for this CDF have the same characteristics as the CDF-TTFB graph, but this graph shows us the performance of the entire download overall.
  • CDF-DL: This is the CDF of the average bandwidth of the second half of a 5Mb download, similar to page 4 of this report
    • Good and bad results for this CDF have the same characteristics as the CDF-TTFB and CDF-TTLB graphs, but this graph shows us the distribution of the steady-state throughput of the network for very long downloads.

Reliability Metrics

  • Failure rainbow: The rate of stream timeouts and other connection failures similar to page 1 of this report. XXX: Circuit timeouts and circuit failures should appear here somehow. Karsten also mentioned new failure types.
    • A good failure rainbow (ie: one that indicates healthy network performance) has a low number of stream timeouts and no user-facing failures, and no failures during download. It should look more like a single color, or largely dominated by a single color, and not like an actual rainbow.
    • A bad failure rainbow looks more like a smeared out actual rainbow. It has lots of failure counts for lots of different colors. The onion service rainbow from that report indicates that onion services are less healthy performance-wise than the public server. To emphasize that Failure Rainbows are bad, only vomit-related color tones should be used.
  • Circuit timeout rate: The frequency of circuit build timeouts observed through BUILDTIMEOUT_SET control port event, or manual counting.
    • The circuit timeout rate should consistently match the cbtquantile consensus parameter (XXX: This could be combined with the Failure Rainbow metric).

Capacity Metrics

The following metrics come from relay extrainfo descriptors. Because relays choose different time intervals for the values in these metrics, we must use much larger on/off time windows for experiments that need these metrics (irl suggests 72 hour cycles, using only the middle 24 hours for results):

  • Utilization: This metric measures the ratio between current average bandwidth read/write history values and peak observed bandwidth (aka peak "advertised bandwidth" over time). Technically we have one of these metrics for each node type (Exit, Guard+Exit, Guard, Middle). Note that raw "advertised bandwidth" is not an accurate reflection of peak capacity of a node -- we want to extract the highest advertised bandwidth value over longer periods of time (eg 1 month) for each node to get a better reflection of peak capacity for use in deriving this metric.
    • A healthy network has a large difference in its peak possible throughput vs the average load it sustains. This means utilization is low, and it has plenty of room for new flows to be added without congestion or contention. (This metric has been improving a a very high rate ever since Snowden, so yay).
    • An unhealthy network operates with an average capacity that is very close to its peak possible throughput. This means most of its streams are in a congested state -- latency will build up and other performance/health metrics should show signs of stress.
  • Bottleneck Utilization: Compute the Guard, Exit, and Total Utilization levels at each time point, and choose the highest utilization value of the three.

Balancing Metrics

  • CDF-Relay-Utilization: Similar to the Per-flag Utilization, it is also possible to derive a CDF of the distribution of the average read/write history divided by peak advertised bandwidth, for each relay in the network. This metric would show us what the distribution of utilization is across the network. It can also be broken down per-flag (so that there are separate CDFs generated for Guard, Middle, Exit, and Guard+Exit flagged relays).
    • A healthy network will be well load-balanced where all relays tend to be operating with similar amounts of reserve capacity in proportion to their total. Thus, this CDF should be narrow and cliff-like, and the cliff should be centered at the same location as the overall Utilization relative to its total (each relay is loaded the same as the overall network).
  • CDF-Relay-Stream-Capacity: To examine the load balancing of relays in terms of the capacity metrics that Torflow and sbws use, for each relay, using either the consensus or the individual votes, compute relay_balance_ratio = relay_measured_bw / relay_observed_bw. This balance ratio is equivalent to relay_stream_bw / network_avg_stream_bw. Like other Per-Relay CDFs, this can be broken down by relay flags, as well (Guard, Middle, Exit, Guard+Exit).
    • A healthy, balanced network will have a cliff in this CDF around 1.0. This means that all relays have the same stream bandwidth when carrying streams.
    • An unhealthy, unbalanced network will have a long, slow sloping hill, and/or lots of lumps below 1.0 and far above 1.0.

Data visualization Issues

For all of the above metrics, we started with the assumption that the full CDF is what we want, to fully capture the full best/worst case and the distribution of the values. However, one major downside of CDFs are that they are difficult to use to represent changes over time. Each CDF graph is a snapshot of performance over some time window.

This leads to the following visualization questions:

  1. Should we generate one CDF per consensus interval during tuning and evaluation?
  2. Should we always generate one CDF per day, so we have a historical archive?
  3. Can we visualize the CDF in another way, by eg quantile plots? (ie: different textures/colors for every 5% quantile)?

Having some way to look at these metrics over time, with their full distribution, will vastly improve our ability to understand performance cycles in the network, as well as reaction to events such as massive user arrival.

Sources of Model Error

In addition to visualization problems, our metrics currently suffer from the following major sources of model error, causing us to fail to accurately represent actual user experience:

  1. Torperf does not use Guard nodes
    • No Guard nodes means that we have no visibility into the effect of guard selection on user experience
    • Some users pick slow guards and have a very slow Tor; some users pick fast guards and have a very fast Tor. We have no idea what this distribution of fast vs slow Tor experience even looks like.
    • Several of our performance tuning experiments require that Guard nodes be involved, to accurately measure effects of Guard selection
  2. Torperf does not have a user activity model
    • Tuning predictive circuit building is not possible without a user activity model
    • 5MB download times do not closely represent web page rendering times, even if all the components of a page sum to 5MB
  3. Torperf does not have an accurate browser model
    • Browser-specific performance improvements (and regressions) due to Optimistic Data, HTTP/2, HTTP Prefetch, and other browser properties cannot be measured by Torperf

Metrics that require new collection methodology

  1. CDF-RTT
    • Requires multiple HTTP Echo Servers in well-chosen geographical locations; OR some kind of hack, like connecting to IP:Port pairs forbidden by Exit policy and timing the rejection response (maybe this is better?)
    • Requires recording and visualizing circuit latency over time, in addition to differences in per-circuit latency.
  2. Guard-based Torperf runs require new torperf instances that use guards (but rotate them quicker than normal)
    • Requires patching Tor to allow torperf to set a short guard rotation interval
    • In most cases, can still just graph overall CDF of all fetches
    • In some cases, might require averaging all runs from a single guard selection as a single datapoint?
  3. Any browser-based metrics (eg: accurate Alexa Top 50 page render times over Tor)
    • Can we get any of these metrics from Mozilla?
Last modified 8 months ago Last modified on Mar 9, 2020, 10:58:20 AM