Opened 17 months ago

Last modified 16 months ago

#24018 new project

Automate measuring connection timeouts per exit

Reported by: arthuredelstein Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Statistics Version:
Severity: Normal Keywords: tbb-performance, tbb-needs
Cc: arthuredelstein, teor, robgjansen, hiro, karsten Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

I have been investigating connection timeouts manually, using Tor Browser in #21394.

My manual test is as follows: I set Tor Browser's pref "extension.torbutton.loglevel to 3. In the Browser console, I filter for the word "TIMEOUT". Then I attempt to connect to a website, and I count the number of TIMEOUTs displayed on the browser console, such as this:

[10-26 06:25:47] Torbutton INFO: controlPort >> 650 STREAM 532 DETACHED 833 2606:2800:220:1:248:1893:25c8:1946:80 REASON=TIMEOUT

I repeatedly hit "New Tor Circuit for this Site" in the torbutton menu and manually write down how many timeouts were observed for each circuit. Here's my data from when I attempted to connect to example.com 50 times:

http://example.com
00100000021000002001001001000000001000000000001000

This sort of stream timeout is because, according to arma:

it means you sent your begin cell, and then you didn't get an end cell or a connected cell after 10 seconds

The dominant source of timeouts appears to be DNS resolution failures at the exit nodes. I observed almost no timeouts connecting directly to IPv4 or IPv6 addresses instead of a domain name (see ticket:21394#comment:20).

Regardless of the cause, I think these timeouts are causing serious damage to Tor Browser usability and we should try hard to fix it.

teor suggested some fixes to tor. In the meantime it would be great if we had an automated test that can measure the frequency of connection timeouts on a daily basis. I imagine it could generate several circuits through each exit node (both to domains and to bare IP addresses) and produce summary statistics. That would also help us know if the fixes are working or if we have any regressions in the future.

Is this something the Metrics team would be interested in working on? I see the timeout statistics on https://metrics.torproject.org/torperf-failures.html but I don't think that is measuring exactly the same thing.

Child Tickets

Change History (12)

comment:1 Changed 17 months ago by arthuredelstein

Cc: arthuredelstein teor added
Keywords: tbb-performance tbb-needs added

comment:2 Changed 17 months ago by iwakeh

Component: MetricsMetrics/Ideas
Type: defectproject

Sorting this into sub-component 'Metrics/Ideas'.

comment:3 Changed 17 months ago by teor

Does OnionPerf use DNS or IP addresses?
Because it would be the logical place to add these checks.

comment:4 in reply to:  3 Changed 17 months ago by karsten

Cc: robgjansen hiro karsten added
Component: Metrics/IdeasCore Tor/Tor

I agree that OnionPerf is probably the best tool for this analysis. I don't know for sure whether it uses DNS or IP addresses, and I'd rather not want to make a guess here. Instead I'm cc'ing robgjansen who can give more insights into what OnionPerf does by default and hiro who can tell whether our three OnionPerf instances derive from those defaults or not.

Regarding the question above whether this is something the Metrics team would be interested in working on, that's a tough question! It seems like the goal here is to fix a specific bug in the Tor daemon and use measurement data from a few weeks or even months to make sure the bug got fixed. And when it's fixed, there's no use for this specific measurement anymore.

My gut feeling is that it's something that the Network team should look into while fixing this bug, with help of the Browser and Metrics teams as necessary. I'm moving this ticket to Core Tor/Tor which I believe is a better fit than Metrics/*. Please move it back if you think I'm wrong. If the task was to set up a new measurement and add it to the Metrics website, the answer would likely be different. Thanks/sorry!

comment:5 Changed 17 months ago by teor

Component: Core Tor/TorMetrics/Statistics

There are two goals that we need to achieve:

  • fix a specific bug in tor, and make sure it is fixed (#24014)
  • continually measure client experience of tor (including DNS lookups), and display it on metrics (this ticket)

comment:6 Changed 17 months ago by arthuredelstein

I agree with teor that we will need to continually make this measurement in the future, in part to detect regressions and also to try to keep improving performance. That said, I appreciate that karsten and the Metrics Team need to be selective about what they work on, so maybe let's wait and see how #21394 proceeds a little first.

Meanwhile I have been using a script to survey the DNS-related stream timeouts (see ticket:21394#comment:27).

comment:7 Changed 17 months ago by Sebastian

I feel like when the current situation is resolved, failing DNS resolution absurdly often should earn you a badexit flag. This can easily happen with misconfigured relays easily

comment:8 in reply to:  7 Changed 17 months ago by arma

Replying to Sebastian:

I feel like when the current situation is resolved, failing DNS resolution absurdly often should earn you a badexit flag.

Agreed.

comment:9 Changed 17 months ago by robgjansen

TLDR; OnionPerf does not currently do a great job of capturing DNS issues, but it could be made to do so.

OnionPerf works by setting up a traffic generator (TGen) client and a server on the machine on which OnionPerf is run. OnionPerf makes the server accessible via an onion service (no DNS involved) as well as via the regular Internet. The client guesses an IP address of the machine on which it is running, and makes requests through Tor to that IP address.

The reason for the above behavior is that it does not depend on the user running OnionPerf actually having a domain name registered for the host running OnionPerf. I thought that was a usability win, but I can see that this means we are missing some important aspect of Tor Browser performance.

In principle, OnionPerf should be able to either pass IP address or domain name to Tgen. In the case that it passes a domain name, TGen will currently do the lookup itself rather than asking the socks proxy (Tor) to do it. In principle, TGen could ask the socks proxy (Tor) to do the lookup instead, and in fact the code already exists to do so in order to handle .onion fetches.

Assuming that DNS errors would then be reported by Tor through the control port, it would be simple to have OnionPerf parse those errors and include them in the output files desscribed here, or else in the raw json files that OnionPerf produces but Tor metrics does not yet publish.

comment:10 Changed 16 months ago by hiro

While some more specific test are being implemented in onionperf, I am setting up prometheus + grafana to monitor our onionperf VMs. The idea is that by having an easier way to look at generated logs we can gain more insights on onionperf and the measurements produced.

I have to finish the configuration of the tools and I will share all the relative information to login and use them.

comment:11 Changed 16 months ago by teor

Parent ID: #21394

Parent ticket is done

comment:12 Changed 16 months ago by arthuredelstein

FWIW I have been running a daily scan of all exits here:
https://arthuredelstein.net/exits/

For each exit, I make 10 connection attempts to http://example.com and record the percentage of timeouts. Here's the source: https://github.com/arthuredelstein/tor_dns_survey

I agree it would be good to have something like this as part of OnionPerf. We want to get and keep the timeout rate near zero.

Note: See TracTickets for help on using tickets.