wiki:org/roadmaps/CoreTor/PerformanceExperiments

Experimental Plan

We're going to perform a series of parameter changes based on consensus parameter changes, both to observe their effects on performance characteristics of the live network, as well as to try to reproduce these effects on a testing network or simulator for use in future experiments.

For each of these experiments, we will enumerate the parameter changes, the metrics required to measure their effects, the expected results, and any anonymity risks.

Each experiments will be performed over a period one week per parameter, alternating the parameter value between one of the experimental values and the default every 24 hour period for metrics that rely only on output from torperf, and over 48 hour periods for metrics that require data from extrainfo descriptors (currently only one experiment requires this).

Metrics Definitions

The following metrics are a minimal set of things we can currently extract from torperf/onionperf, to get an approximation of user-perceived performance characteristics of the live Tor network. While the following metrics use the same data as is used to provide the graphs on metrics.torproject.org, they capture more detail.

In particular, it is important for us to know the full distribution of performance, because performance variance is one of the main things that influences a user's perceived experience. Most of the below have been generated by Karsten for our previous meetings:

  • Failure rainbow: The rate of stream timeouts and other connection failures similar to page 1 of this report. XXX: Circuit timeouts and circuit failures should appear here somehow. Karsten also mentioned new failure types.
    • A good failure rainbow (ie: one that indicates healthy network performance) has a low number of stream timeouts and no user-facing failures, and no failures during download. It should look more like a single color, or largely dominated by a single color, and not like an actual rainbow.
    • A bad failure rainbow looks more like a smeared out actual rainbow. It has lots of failure counts for lots of different colors. The onion service rainbow from that report indicates that onion services are less healthy performance-wise than the public server. To emphasize that Failure Rainbows are bad, only vomit-related color tones should be used.
  • CDF-TTFB: Cumulative distribution function of the time-to-first-byte of a 5MB download. See page 2 of this report.
    • A good CDF-TTFB should look like a cliff (very little performance variance in times) and this cliff should be close to the origin of the graph (very fast response times overall).
    • A bad CDF-TTFB will look like a long, slow climb (high variance in performance and lots of slow results), and be very far from the origin of the graph (slow overall/average case performance).
  • CDF-TTLB: Cumulative distribution function of the time-to-last-byte of a 5MB download. See page 3 of this report
    • Good and bad results for this CDF have the same characteristics as the CDF-TTFB graph, but this graph shows us the performance of the entire download overall.
  • CDF-DL: This is the CDF of the average bandwidth of the second half of a 5Mb download, similar to page 4 of this report
    • Good and bad results for this CDF have the same characteristics as the CDF-TTFB and CDF-CDF-TTLB graphs, but this graph shows us the distribution of the steady-state throughput of the network for very long downloads.
  • Circuit timeout rate: The frequency of circuit build timeouts observed through BUILDTIMEOUT_SET control port event, or manual counting.
    • The circuit timeout rate should consistently match the cbtquantile consensus parameter (XXX: This could be combined with the Failure Rainbow metric).

The following metrics come from relay extrainfo descriptors. Because relays choose different time intervals for the values in these metrics, we must use much larger on/off time windows for experiments that need these metrics (irl suggests 72 hour cycles, using only the middle 24 hours for results):

  • Per-Flag Spare Network Capacity: These metrics measure the difference between peak observed bandwidth (aka peak "advertised bandwidth" over time) and current average bandwidth read/write history values. (We don't provide this directly but it can be computed via our public graphs). Technically we have one of these metrics for each node type (Exit, Guard+Exit, Guard, Middle). Note that raw "advertised bandwidth" is not an accurate reflection of peak capacity of a node -- we want to extract the highest advertised bandwidth value over longer periods of time (eg 1 month) for each node to get a better reflection of peak capacity for use in deriving this metric.
    • A healthy network has a large difference in its peak possible throughput vs the average load it sustains. This means it has plenty of room for new flows to be added without congestion or contention. (This metric has been improving a a very high rate ever since Snowden, so yay).
    • An unhealthy network operates with an average capacity that is very close to its peak possible throughput. This means most of its streams are in a congested state -- latency will build up and other performance/health metrics should show signs of stress.
  • Per-Relay Spare Network Capacity CDF: Similar to the Per-flag Spare Network Capacity, it is also possible to derive a CDF of the distribution of the difference between peak advertised bandwidth and average read/write history for all relays in the network (as a percentage of the advertised bandwidth for that relay). This metric would show us what the distribution of spare capacity is across the network.
    • A healthy network will be well load-balanced where all relays tend to be operating with similar amounts of reserve capacity in proportion to their total. Thus, this CDF should be narrow and cliff-like, and the cliff should be centered at the same location as the overall spare network capacity relative to its total (each relay is loaded the same as the overall network).

Experiments

Circuit Build Timeout

  • Parameter values to test:
    • consensus: cbtquantile=40, cbtquantile=60, cbtquantile=80 (current), cbtquantile=90
  • Metrics:
    • CDF-TTFB
    • Failure rainbow
    • Circuit timeout rates
  • Expected results:
    • Reducing cbtquantile improves CDF-TTFB considerably (CDF-TTLB and CDF-DL would show less improvement, since congestion changes over time)
    • As we set cbtquantile lower, the CDF-TTFB graph should become a sharper cliff, and move to the left. This is because it will both reduce latnecy -- the left shift, and reduce performance variance -- make the CDF more cliff-like.
    • Because this is a congestion-avoidance mechanism, we should see *increasing* amounts of returns for each percentile of decrease of cbtquantile parameter (this is because all clients will be avoiding more congested+slow circuits, which means less congested and slow circuits on the network overall, which means less overall latency).
    • Tor clients give up on the selected percentage of circuits (not more, not less)
    • Circuit failure rates likely go down for timeout-related failures
  • Potential Sources of Model Error:
    • The circuit build timeout code was designed when we used three guards. It may no longer actually enforce that a proper cbtquantile of circuits time out with 1 or 2 guards. This may affect performance positively or negatively, as well as have anonymity impact.
    • Since torperf does not use guards, it may exhibit different results without them than with them; we may want to perform this experiment in tandem with the Number of Guards experiment (or at least run an additional torperf instance with a short GuardLifetime value, as suggested in the Number of Guards experiment).
  • Anonymity effects:
    • Path reduction (clients will only use the fastest 'cbtquantile' percent of paths, which means less network paths are used)
    • In extreme cases of very low cbtquantile, clients will tend to prefer network paths that contain only routers that are geographically close to them, which may leak information about their geographical location.
  • Instrumentation Needed To Verify Operation:
    • On the torperf clients, the BUILDTIMEOUT_SET control port event should have a CUTOFF_QUANTILE field value that matches the cbtquantile consensus parameter. Additionally, the rate of circuit timeouts on torperf should match 1.0-CUTOFF_QUANTILE, as well as match the TIMEOUT_RATE field of BUILDTIMEOUT_SET.
  • Abort Criteria:
    • If the TIMEOUT_RATE field or the manually counted circuit timeout rate exceeds 1.0-CUTOFF_QUANTILE by more than 0.1, the experiment should be stopped and we should investigate and debug the circuit build timeout code.
    • The failure rate of torperf and onionperf should be closely monitored, to ensure that onion services do not have unexpected amounts of additional failure during this experiment. If failure rates increase, we should abort.
    • If Torperf uses exit nodes and rendezvous points out of proportion to their consensus weights after this change, we should abort. The vanguards rendguard component has code to monitor rend point use already; it can be adapted for exits as well.
  • User Impact/What to Tell Users:
    • So long as the abort criteria are not met, the user impact should be minimal for small changes to this parameter: latency should just improve. For very low values, the geolocation concern increases, but we should be able to rule those out through the abort criteria.

Fast Relay Cutoff

  • Parameters:
    • consensus: FastFlagMinThreshold=$(bandwidth_of_n_percent_fastest_nodes)
    • torrc: AuthDirFastGuarantee=0
  • Metrics:
    • Failure rainbow
    • CDF-TTLB
    • CDF-DL
    • Per-Relay Spare Network Capacity CDF
  • Expected results:
    • Slow relays in the network are overloaded more than faster relays. Cutting them out should reduce the overall rates of timeout-related failures.
    • It should similarly reduce the variance of the performance of the network, to the extent that the slow relays would have been chosen. This should mean that the CDF-TTLB and CDF-DL graphs become more cliff-like (but should not shift left overall, like we expected for CBT).
    • The Per Relay Spare Network Capacity CDF should narrow and become more cliff-like, since slow relays are more overloaded than the rest of the network.
  • Potential Sources of Model Error:
    • We don't know which slow relays are slow, or why. Depending on the threshold, we may cut out unused relays, overloaded relays, or relays suffering from other bugs (see the KIST experiment).
  • Anonymity effects:
    • Less relays means less diversity and less possible network paths, in proportion to where we set the cutoff at.
  • Instrumentation Needed To Verify Operation::
    • Relays with a measured bandwidth below the cutoff should no longer appear in the consensus
  • Abort Criteria:
    • If relays other than the expected cutoff set disappear from the consensus, abort.
  • User Impact/What to Tell Users:
    • Relay operators on tor-relays should be made aware of these plans; possibly also mailing contact info of affected relays, where it is available.

KIST

  • Parameters:
    • consensus: KISTSchedRunInterval at 2ms, 5ms, 10ms (default)
  • Metrics:
    • CDF-TTFB
    • CDF-DL
  • Expected Results:
    • The KIST scheduler interval has an effect on how often we are able to read and write data to the network. For relays with lots of TCP connections, a larger interval is better. For relays with only very few, a smaller interval is better. See #29427.
    • Depending on the number of connections that typical relays have, different values of this parameter may increase performance variance of steady state downloads (CDF-DL), as well as have an impact on latency (CDF-TTFB).
  • Potential Sources of Model Error:
    • The KIST scheduler consensus value may also apply to the Torperf client itself, which will deeply impact the results we see.
  • Anonymity Effects:
    • This experiment may help us get better performance out of slow relays, which will improve anonymity.
  • Instrumentation Needed To Verify Operation::
    • XXX: dgoulet/pastly?
  • Abort Criteria:
    • XXX: dgoulet/pastly?
  • User Impact/What to Tell Users:
    • XXX: Did we even tell our users when we deployed KIST or EWMA apart from the changelog lines?

Number of Guards

  • Parameters:
    • consensus: guard-n-primary-guards-to-use=2 (1 is default)
    • consensus: guard-n-primary-guards=2 (1 is default)
    • torperf torrc: UseEntryGuards 1
    • torperf torrc: GuardLifetime 1 day (or less; requires tor patch)
  • Metrics:
    • CDF-TTFB
    • CDF-TTLB
    • CDF-DL
    • Failure rainbow
    • Circuit timeouts (maybe)
  • Expected Results:
    • With only one guard, Torperf's variance for all performance characteristics should be much larger. Additionally, with more than one guard, Circuit Build Timeout should be able to avoid one of the guards if either become temporarily overloaded. As a result, we should also see an increase in average performance. So switching to two guards should make all CDFs more cliff-like and move them all to the left (towards origin/faster performance).
  • Potential Sources of Model Error:
    • Torperf doesn't use guards by default; making it do so in a way that allows us to get an idea of the performance variance of different combinations of guards will require a much longer run of this experiment than of the others.
  • Anonymity Effects:
    • See Proposal 291
  • Instrumentation Needed To Verify Operation::
    • Control port monitoring of the GUARD and ORCONN events to ensure that our torperf instances use the expected number of guards
  • Abort Criteria:
    • If more guards than we expect are used, abort.
  • User Impact/What to Tell Users:
    • Probably requires a blog post; guard issues are a deep rabbithole.

Preemptive Circuit Building

  • Parameters:
  • Metrics:
  • Expected Results:
  • Potential Sources of Model Error:
    • If torperf does not use the same number of circuits as we expect most clients to use, and in the same patterns, this will bias our results.
  • Anonymity Effects:
  • Instrumentation Needed To Verify Operation::
  • Abort Criteria:
  • User Impact/What to Tell Users:

EWMA

  • Parameters:
  • Metrics:
  • Expected Results:
  • Potential Sources of Model Error:
  • Anonymity Effects:
  • Instrumentation Needed To Verify Operation::
  • Abort Criteria:
  • User Impact/What to Tell Users:

Experiment Template

  • Parameters:
  • Metrics:
  • Expected Results:
  • Potential Sources of Model Error::
  • Instrumentation Needed To Verify Operation::
  • Anonymity Effects:
  • Abort Criteria:
  • User Impact/What to Tell Users:
Last modified 13 days ago Last modified on May 9, 2019, 8:08:13 PM