Opened 23 months ago

Last modified 4 weeks ago

#21394 merge_ready defect

connection timeouts are affecting Tor Browser usability

Reported by: arthuredelstein Owned by:
Priority: Very High Milestone: Tor: 0.2.9.x-final
Component: Core Tor/Tor Version:
Severity: Normal Keywords: 029-backport, tbb-performance, tbb-usability, performance, tbb-needs, 033-triage-20180320, 033-included-20180320, 031-unreached-backport, 032-unreached-backport
Cc: gk, brade, mcs, arthuredelstein, tom, dhalgren.tor@… Actual Points:
Parent ID: Points:
Reviewer: mikeperry Sponsor:

Description (last modified by arthuredelstein)

I have spent some time watching circuit and stream events while connecting to different sites. I telnet into tor's config port using the following command (using ts to give time stamps):

telnet localhost 9151 | ts

I open the browser console and get the tor password by entering
m_tb_control_pass
And then I paste the result like this:
authenticate [value of m_tb_control_pass]
Finally I enter
setevents circ stream.

What I have noticed is that a significant fraction of new site connections result in at least 1 timeout of 10 seconds. (Tor Browser's CircuitStreamTimeout is set to 0, which results in a timeout equal to MIN_CIRCUIT_STREAM_TIMEOUT, or 10 seconds.) Here's what a timeout looks like:

Feb 03 19:00:03 650 STREAM 868 NEW 0 people.torproject.org:443 SOURCE_ADDR=127.0.0.1:50318 PURPOSE=USER
Feb 03 19:00:03 650 CIRC 149 LAUNCHED BUILD_FLAGS=NEED_CAPACITY PURPOSE=GENERAL TIME_CREATED=2017-02-04T03:00:03.738597
Feb 03 19:00:03 650 CIRC 149 EXTENDED [...] BUILD_FLAGS=NEED_CAPACITY PURPOSE=GENERAL TIME_CREATED=2017-02-04T03:00:03.738597 SOCKS_USERNAME="torproject.org" SOCKS_PASSWORD="7d8ea4ccf4ba6345846e0fccacd4d941"
Feb 03 19:00:04 650 CIRC 149 EXTENDED [...] BUILD_FLAGS=NEED_CAPACITY PURPOSE=GENERAL TIME_CREATED=2017-02-04T03:00:03.738597 SOCKS_USERNAME="torproject.org" SOCKS_PASSWORD="7d8ea4ccf4ba6345846e0fccacd4d941"
Feb 03 19:00:04 650 CIRC 149 EXTENDED [...] BUILD_FLAGS=NEED_CAPACITY PURPOSE=GENERAL TIME_CREATED=2017-02-04T03:00:03.738597 SOCKS_USERNAME="torproject.org" SOCKS_PASSWORD="7d8ea4ccf4ba6345846e0fccacd4d941"
Feb 03 19:00:04 650 CIRC 149 BUILT [...] BUILD_FLAGS=NEED_CAPACITY PURPOSE=GENERAL TIME_CREATED=2017-02-04T03:00:03.738597 SOCKS_USERNAME="torproject.org" SOCKS_PASSWORD="7d8ea4ccf4ba6345846e0fccacd4d941"
Feb 03 19:00:04 650 STREAM 868 SENTCONNECT 149 people.torproject.org:443
[...]
Feb 03 19:00:14 650 STREAM 868 DETACHED 149 people.torproject.org:443 REASON=TIMEOUT
Feb 03 19:00:14 650 CIRC 150 LAUNCHED BUILD_FLAGS=NEED_CAPACITY PURPOSE=GENERAL TIME_CREATED=2017-02-04T03:00:14.588714

After a timeout occurs, the tor client closes the circuit, builds a new circuit and attempts to connect to the same site again. This repeats at least 3 times.

I did an experiment where I connected to https://people.torproject.org/~arthuredelstein (a page with hardly any content) and then repeatedly selected "New Tor Circuit for this Site" 50 times.

Here are the results for 50 reloads. Each digit represents the number of 10-second stream timeouts observed before a given connection succeeded.
20020000000000000000002010000000000001000100000103

In other words 8 out of 50 connections showed a timeout. And interestingly, four of these connections exhibited a double or triple timeout (20 or 30 seconds delay).

I think this may be a big part of the perception of Tor Browser as "slow". Actual loading of pages doesn't seem drastically slow to me, and once I have successfully connected to a new site, following links to other pages on the same site (i.e., the same circuit) is usually acceptable.

(I also did another quick test on another site and 5/25 connections had at least 1 timeout.)

So here are some questions for further investigation:

  • Why are there so many timeouts? Are any of these timeouts due to silent errors in a Tor node? (If such errors could be promptly reported back to the client, maybe we could avoid the waiting for the long timeout.)
  • What's the reason for MIN_CIRCUIT_STREAM_TIMEOUT being 10 seconds? Would it do any harm to make this shorter, say 5 seconds or 2 seconds?
  • So many double or triple timeouts are suspicious, because each timeout in a double or triple is reported for a different circuit. Could this mean the connection error is caused by the client or guard rather than a connection failure at the exit node?

Child Tickets

TicketStatusOwnerSummaryComponent
#24010closedaagbsnMake bandwidth authorities use DNS, not IP addressesCore Tor/Torflow
#24022closedMake clients avoid retrying slow exits when they time outCore Tor/Tor

Attachments (1)

bug21394_touch_up.patch (2.3 KB) - added by Dhalgren 12 months ago.
recommended touch-up patch

Download all attachments as: .zip

Change History (99)

comment:1 Changed 23 months ago by arthuredelstein

Description: modified (diff)

comment:2 Changed 23 months ago by arthuredelstein

Description: modified (diff)

comment:3 Changed 23 months ago by arthuredelstein

Description: modified (diff)

comment:4 Changed 22 months ago by gk

Cc: gk added

comment:5 Changed 22 months ago by arthuredelstein

Description: modified (diff)

comment:6 Changed 22 months ago by mcs

Cc: brade mcs added

comment:7 in reply to:  description Changed 22 months ago by arma

Replying to arthuredelstein:

  • What's the reason for MIN_CIRCUIT_STREAM_TIMEOUT being 10 seconds? Would it do any harm to make this shorter, say 5 seconds or 2 seconds?

This one is straightforward: if it's 5 seconds or 2 seconds, then people on crummy (slow, lossy) internet connections will forever be giving up on the circuit before they even get the connected cell.

The underlying problem is that this is a static number for all users, not an adaptive number like the CircuitBuildTimeout.

I feel like I had a ticket long ago for making the stream timeout adaptive too, but maybe that maybe never found its way into being a ticket.

(Even 10 seconds is too short for some people on crummy networks, which causes them to forever be abandoning circuits right before they work, and moving to new ones which they then abandon, and their Tor experience is no fun.)

comment:8 Changed 22 months ago by arthuredelstein

I had a conversation with arma on IRC and he made many good suggestions on how to go about investigating this further (reprinted with permission):

16:49 < arthuredelstein> In general, do connection timeout errors come from the exit node, or from the client?
16:50 < armadev> it means you sent your begin cell, and then you didn't get an end cell or a connected cell after 10 seconds
16:50 < armadev> it could be that you don't really have a tls connection to your guard at all, you just think you do
16:51 < armadev> it could be that the exit receives the begin cell and quietly drops it
16:51 < armadev> or maybe it gets the begin cell and starts its dns resolve and that takes a while
16:51 < armadev> one way to investigate further might be to see if you ever get a connected or end cell if you waited longer
16:52 < arthuredelstein> Ah, that's a good idea.
16:54 < arthuredelstein> Do you have an hypothesis why there are so many timeouts? Do you think exits are dropping cells?
16:54 < armadev> i am wondering if it has to do with the ipv6 thing
16:54 < armadev> we have a bunch of bugs in ipv6 handling
16:55 < arthuredelstein> that's interesting
16:56 < arthuredelstein> in other words, handling at the exit?
16:57 < armadev> yes
16:57 < armadev> is there some pattern with which exits are on problem circuits?
16:57 < armadev> you have the circuit events i hope so you can do the stats?
16:57 < armadev> it is also possible that some exits, or even really just a few but really big ones, and running out of file descriptors or something
16:58 < arthuredelstein> another good idea. I will look into that.
16:58 < armadev> s/and running/are running/
16:59 < armadev> people.tp.o has an ipv4 and ipv6 address. can you pick something simple and static that's only v4, and is that different?
17:01 < arthuredelstein> makes sense
17:02 < arthuredelstein> Something that made me wonder if it's something closer to the client or guard is that in my first batch of tests (to people.torproject.org) half of the attempted connections were double timeouts, meaning two circuits with different exits failed before a successful connection was made.
17:03 < arthuredelstein> it's -> the cause of the timeouts is
17:08 < armadev> another thing to explore is sending cells end-to-end on the circuit that we know should elicit an immediate response
17:08 < armadev> like a begin to 127.0.0.1
17:08 < armadev> which should immediately reply with 'end, exitpolicy'
17:08 < armadev> and bypass any attempts by the exit to do a dns resolve, open a socket, make a tcp connection, etc
17:16 < arthuredelstein> What's easiest way to send a begin cell?
17:17 < armadev> make a socks request?
17:17 < armadev> there might be something on the client side that tries to block a request to a destination it knows will fail
17:17 < armadev> and also tor browser does isolation by socks parameters so the new socks request will be isolated to a different circuit
17:18 < armadev> but i bet fixing those will still be more fun than my other answer, which is to check out how to call connection_ap_handshake_send_begin()
17:19 < arthuredelstein> Right. I think Tor Browser is blocking connections to 127.0.0.1.
17:19 < armadev> heck, the browser itself might be blocking those too
17:19 < arthuredelstein> or possibly not making a socks connection
17:19 < armadev> and the tor client will be blocking them even if the browser isn't
17:19 < armadev> i guess that's yet another experiment:
17:19 < armadev> do this same experiment with your tor client, no browser involved
17:20 < armadev> and no weird socks isolation
17:20 < arthuredelstein> Yes.
17:20 < armadev> and no weird preferipv6 socksport flag
17:21 < arthuredelstein> aha
17:24 < arthuredelstein> I guess I can also try connecting to port 80 of the exit's IP address as an alternative to 127.0.0.1.
17:25 < armadev> good idea
17:25 < armadev> (though then you have to guess the exit already)
17:25 < arthuredelstein> Yeah, I would need to turn off socks isolation.
17:25 < arthuredelstein> Or maybe do this outside the browser
17:26 < arthuredelstein> maybe I need to get acquainted with stem so I can automate these tests
17:27 < arthuredelstein> assuming the browser isn't causing the problem somehow
17:29 < armadev> having it automated would be extra cool because then it could be done again later without redoing all the work
17:33 < armadev> let me hunt down a ticket you'll find fun and related (though alas not the same)
17:35 < armadev> #5830
17:40 < arthuredelstein> And I see you also mention the possibility of instrumenting a browser.
19:49 < armadev> yet another thought: if this happens pretty consistently, can you collude with an exit relay to get debug-level logs at the time of the failure? to see what it sees and what it doesn't see? safelogging might make that harder.
19:49 < arthuredelstein> yeah, that would be great
19:50 < armadev> the precursor to that idea is: can you induce this behavior in a chutney network?
19:50 < armadev> i would assume no, because it requires real users, real load, real broken exits. but who knows!
19:50 < armadev> oh, and another: if you're curious if it's your guard, do the experiment again with a different guard!
19:51 < arthuredelstein> yeah, I should definitely do that!
19:52 < armadev> if your guard is overloaded, you could easily be seeing a delay there
19:52 < armadev> or the intermediate node too, for that matter
19:52 < arthuredelstein> right
19:52 < armadev> where you have to wait for somebody's freight train of packets to move before you can get your connected cell
19:54 < armadev> i guess category 1 of problem, you send your begin and it vanishes. you'll never get an answer.
19:55 < armadev> category 2, everything's working, it's just slow/congested, and you need more patience than the hard-coded 10s timeout.
19:55 < armadev> cranking up the timeout should help distinguish, for starters.
19:55 < arthuredelstein> yes
20:03 < arthuredelstein> Are there cases where a properly-behaving exit is expected to have category 1 behavior? Or should it always return an error message to the client if a tcp connection fails?
20:10 < armadev> every non-response is a bug
20:10 < armadev> are there bugs? there used to be! we don't know of any now.
20:11 < armadev> but of course, weird tcp stacks, and firewalls with rules that drop packets, can induce long timeouts

comment:9 Changed 22 months ago by dgoulet

Milestone: Tor: unspecified

comment:10 Changed 20 months ago by teor

This happened to me with Tor Browser 6.5.1 after my machine was asleep for 5 minutes. Then, after it went to sleep and woke again, everything worked fine.

comment:11 Changed 20 months ago by cypherpunks

16:51 < armadev> or maybe it gets the begin cell and starts its dns resolve and that takes a while

This.

Last edited 20 months ago by cypherpunks (previous) (diff)

comment:12 Changed 20 months ago by cypherpunks

Last edited 20 months ago by cypherpunks (previous) (diff)

comment:13 Changed 20 months ago by cypherpunks

Last edited 20 months ago by cypherpunks (previous) (diff)

comment:14 Changed 20 months ago by cypherpunks

Opened: 2017-02-04
Changes in version 0.3.0.2-alpha - 2017-01-23

comment:15 Changed 19 months ago by cypherpunks

Keywords: performance added
Milestone: Tor: unspecifiedTor: 0.3.1.x-final

Guys, the number of timeouts in recent versions of Tor is really unacceptable.
It's more bad that even succeeded circuits show timeout after a period of time (e.g. when browsing Trac tickets slowly).
So we need to do something with it now.

comment:16 Changed 17 months ago by nickm

Milestone: Tor: 0.3.1.x-finalTor: 0.3.2.x-final

comment:17 Changed 17 months ago by nickm

Priority: MediumVery High

comment:18 Changed 16 months ago by cypherpunks

[08-04 17:35:24] Torbutton INFO: controlPort >> 650 STREAM 6240 DETACHED 798 trac.torproject.org:443 REASON=TIMEOUT
[08-04 17:35:31] Torbutton INFO: controlPort >> 650 STREAM 6240 DETACHED 802 trac.torproject.org:443 REASON=END REMOTE_REASON=RESOLVEFAILED
[08-04 17:35:32] Torbutton INFO: controlPort >> 650 STREAM 6241 DETACHED 798 trac.torproject.org:443 REASON=TIMEOUT
[08-04 17:35:34] Torbutton INFO: controlPort >> 650 STREAM 6241 REMAP 804 [2a01:4f8:172:39ca:0:dad3:3:1]:443 SOURCE=EXIT

It looks like trac.tpo is always resolved as IPv6 address, and Tor switches through the exit nodes until one with IPv6 support found.

comment:19 in reply to:  18 Changed 15 months ago by teor

Replying to cypherpunks:

[08-04 17:35:24] Torbutton INFO: controlPort >> 650 STREAM 6240 DETACHED 798 trac.torproject.org:443 REASON=TIMEOUT
[08-04 17:35:31] Torbutton INFO: controlPort >> 650 STREAM 6240 DETACHED 802 trac.torproject.org:443 REASON=END REMOTE_REASON=RESOLVEFAILED
[08-04 17:35:32] Torbutton INFO: controlPort >> 650 STREAM 6241 DETACHED 798 trac.torproject.org:443 REASON=TIMEOUT
[08-04 17:35:34] Torbutton INFO: controlPort >> 650 STREAM 6241 REMAP 804 [2a01:4f8:172:39ca:0:dad3:3:1]:443 SOURCE=EXIT

It looks like trac.tpo is always resolved as IPv6 address, and Tor switches through the exit nodes until one with IPv6 support found.

#21310 and #21311 may fix this, but we might also need a fix on the client side.

comment:20 Changed 14 months ago by arthuredelstein

I did some more experiments:

First I tried https://torpat.ch. This does not have an IPv6 address, as far as I can tell. I got the following result:

https://torpat.ch:
00000000040000400022002100020100002100000102011100

where each digit represents an attempt to connect, and the number of timeouts before the connection succeeded. So to me it doesn't look like IPv6 is the (only) problem. 15/50 attempts included at least one timeout. About half had more than one timeout.

So I thought I should check IPv6 as well. I found the IPv6 address for perdulce.torproject.org:

http://[2a01:4f8:172:1b46:0:abba:11:1]/ (perdulce.torproject.org ipv6): 
000000000000000000000000000000000000000000000000000000000000000

To my surprise, there were no timeouts at all. So I tried IPv4:

http://138.201.14.203/ (perdulce.torproject.org ipv4)
000000000000000000000000000000000000000000000000000000000000000

Again no timeouts. Then I tried a site on perdulce:

https://people.torproject.org/~arthuredelstein/ :
011000310000000111110000000010130100110122001010000

and the timeouts returned (20/50). That made me think this has something to do with the DNS resolve. To check this, I tried another site, example.com, including https with domain, http with domain, bare IPv4, and bare IPv6:

https://example.com
00000010000010000000020010010210100000000000000003

http://example.com
00100000021000002001001001000000001000000000001000

http://93.184.216.34/
00000000000000000000000000000000000000000000000000

http://[2606:2800:220:1:248:1893:25c8:1946]/
00000000000000000000000000000000000000000000000010

Indeed I got 9/50 timeouts for the domain with http or https, but no timeouts for IPv4 and only a single timeout for IPv6.

Does this ring any bells for Tor core experts? What might be happening with DNS here? Again I think the multiple timeouts are a little suspicious, and I don't quite understand how that jibes with it being a (pure) exit node problem.

comment:21 Changed 14 months ago by gk

Keywords: tbb-needs added

comment:22 in reply to:  20 ; Changed 14 months ago by teor

Replying to arthuredelstein:

I did some more experiments:

...
Indeed I got 9/50 timeouts for the domain with http or https, but no timeouts for IPv4 and only a single timeout for IPv6.

Does this ring any bells for Tor core experts? What might be happening with DNS here?

Some exits may be overloading their resolvers. Or our code may be buggy. It would be helpful to identify the particular exits that are experiencing these timeouts, and work out if they are in the same AS or using the same resolvers.

I also wonder if we should ask bandwidth authorities to use DNS whenever possible, so they see DNS timeouts, and downgrade exits that have them. See #24010.

Again I think the multiple timeouts are a little suspicious, and I don't quite understand how that jibes with it being a (pure) exit node problem.

The only node in a tor path that uses DNS is an exit, so if DNS breaks, it causes issues at the exit.

comment:23 in reply to:  22 ; Changed 14 months ago by arthuredelstein

Replying to teor:

Replying to arthuredelstein:

I did some more experiments:

...
Indeed I got 9/50 timeouts for the domain with http or https, but no timeouts for IPv4 and only a single timeout for IPv6.

Does this ring any bells for Tor core experts? What might be happening with DNS here?

Some exits may be overloading their resolvers. Or our code may be buggy. It would be helpful to identify the particular exits that are experiencing these timeouts, and work out if they are in the same AS or using the same resolvers.

Makes sense. If the DNS resolve fails at an exit, does the exit send an error message back to the client? Or does it silently fail, meaning the client has to wait for the full 10-second timeout?

I also wonder if we should ask bandwidth authorities to use DNS whenever possible, so they see DNS timeouts, and downgrade exits that have them. See #24010.

Nice idea. Would it also be feasible to have exits periodically run diagnostics to see if their DNS resolution is working properly, and if not, report the problem to bandwidth authorities and notify their relay operator?

The only node in a tor path that uses DNS is an exit, so if DNS breaks, it causes issues at the exit.

That seems sensible. I'm only a little puzzled that it seems more common than I would expect that I saw not a single timeout, but a double, triple or quadruple timeout (see instances of 2,3,4 in my raw data). Presumably it's switching to a new exit node after each individual timeout, so why do I frequently see multiple timeouts for a single connection? Maybe it's just bad luck, but it made me wonder if I'm seeing something that goes wrong for the whole connection attempt and not just individual circuits.

comment:24 in reply to:  23 ; Changed 14 months ago by teor

Milestone: Tor: 0.3.2.x-finalTor: 0.3.3.x-final

Replying to arthuredelstein:

Replying to teor:

Replying to arthuredelstein:

I did some more experiments:

...
Indeed I got 9/50 timeouts for the domain with http or https, but no timeouts for IPv4 and only a single timeout for IPv6.

Does this ring any bells for Tor core experts? What might be happening with DNS here?

Some exits may be overloading their resolvers. Or our code may be buggy. It would be helpful to identify the particular exits that are experiencing these timeouts, and work out if they are in the same AS or using the same resolvers.

Makes sense. If the DNS resolve fails at an exit, does the exit send an error message back to the client? Or does it silently fail, meaning the client has to wait for the full 10-second timeout?

It depends on how it fails.
If the resolve times out at the exit, it also times out at the client,
If the resolve fails fast, a an error cell is sent to the client.
I don't think we can make this faster.

I also wonder if we should ask bandwidth authorities to use DNS whenever possible, so they see DNS timeouts, and downgrade exits that have them. See #24010.

Nice idea. Would it also be feasible to have exits periodically run diagnostics to see if their DNS resolution is working properly

Yes. Exits already check DNS at startup, and turn off exit traffic if it fails. I opened #24014 in 0.3.3 to make them check periodically.

and if not, report the problem to bandwidth authorities

There's no way for relays to report anything directly to the bandwidth authorities.
Instead, relays modify their descriptors in response to self-checks.
In this case, the relay would disable exit traffic until a DNS check succeeds, and clients would find out about it when they next download its (micro)descriptor after the next consensus.

and notify their relay operator?

Yes, this would be part of #24014: we will log a warning when we disable exit traffic.

The only node in a tor path that uses DNS is an exit, so if DNS breaks, it causes issues at the exit.

That seems sensible. I'm only a little puzzled that it seems more common than I would expect that I saw not a single timeout, but a double, triple or quadruple timeout (see instances of 2,3,4 in my raw data). Presumably it's switching to a new exit node after each individual timeout, so why do I frequently see multiple timeouts for a single connection? Maybe it's just bad luck, but it made me wonder if I'm seeing something that goes wrong for the whole connection attempt and not just individual circuits.

You could also have a slow guard, or a site that has slow DNS,

But the most likely explanation is that some exits are massively overloaded, and DNS bears the brunt of that overloading.
We could encourage relay operators to use a local DNS cache, but threads on this come up every month or two on tor-relays, so I'm not sure starting another would be useful.

Another task that's in progress is to shift exit bandwidth away from the US east coast and Western Europe, because there's an over-allocation in that area at the moment. (It is where most bandwidth authorities have their HTTPS servers.)

I would suggest that we find a way of monitoring this, so we can check if our fixes make a difference.
This might be a task for metrics, I'll leave it to you to open a ticket, because you know what needs to be done to test for timeouts.

There's nothing in this ticket that core tor can bugfix in 0.3.2, so I'm moving it to 0.3.3.

comment:25 in reply to:  24 Changed 14 months ago by arthuredelstein

Replying to teor:

It depends on how it fails.
If the resolve times out at the exit, it also times out at the client,
If the resolve fails fast, a an error cell is sent to the client.
I don't think we can make this faster.

OK, good to know a fast fail is reported by through the circuit. I only wonder if it's possible there could be a bug in sending or receiving that error that makes it fail sometimes.

Yes. Exits already check DNS at startup, and turn off exit traffic if it fails. I opened #24014 in 0.3.3 to make them check periodically.

Excellent, thanks.

You could also have a slow guard, or a site that has slow DNS,

I believe I've seen the same failure rate on a couple of different guards but I should investigate that further.

I would suggest that we find a way of monitoring this, so we can check if our fixes make a difference.
This might be a task for metrics, I'll leave it to you to open a ticket, because you know what needs to be done to test for timeouts.

Good idea -- I opened #24018 for that.

comment:26 Changed 14 months ago by arthuredelstein

Cc: arthuredelstein added

comment:27 Changed 14 months ago by arthuredelstein

I wrote a script using atagar's very nice stem library to try to test DNS behavior across all exits:
https://github.com/arthuredelstein/tor_dns_survey/tree/04176dbf9af1ec4d8f4753f504dcc6652bd257c7

The script creates 2-hop circuits. Each exit was tested 5 times for example.com and 5 times for '93.184.216.34', the bare IPv4 for example.com.

Here's the raw-ish data:
https://arthuredelstein.net/tor/21394/exit_results_20171029.json
The format is

{ address : { exit : [ [ attempt1_result, attempt1_date_time, attempt1_duration ], [ attempt2... ] ... ] } }

with about 5 attempts made per address/exit combination.

I did a quick analysis of the data and found the following.

  • For the raw IP address, the overall timeout frequency was 1.39%. 3 exits always timed out, 29 exits sometimes did, and 752 always suceeded (the rest were assorted fast failures).
  • For example.com, the overall timeout frequency was 9.03%, 41 exits always timed out, 74 sometimes did, and 653 always succeed (again the remainder were other failures).

It's interesting that the observed timeout frequency here was only half that observed in comment:20. Perhaps that's because here I tested all exits equally, instead of using consensus weights to choose exits as Tor Browser does. But I would have expected more timeouts, not less, if I use the low bandwidth exits more, so I'm surprised by that.

In any case, if the timeout probability were uniformly 9% for all exits, I would expect essentially no cases of an exit failing 5 out of 5 times. So it's clear that the probability of some exits failing is much higher than others. Therefore we have "bad" and "flaky" exits.

Last edited 14 months ago by arthuredelstein (previous) (diff)

comment:28 Changed 14 months ago by arthuredelstein

Here's a table of the same data as comment:27. The second column, timeout_rate_21394, shows the fraction of connections for each exit that timed out (after 10 seconds). I also included onionoo relay metadata so we can look for any patterns:

https://docs.google.com/spreadsheets/d/16-SCjR7dTjEN_O3MOX4ZhlR47tvJYONmXICNJkGYnvw/edit?usp=sharing

One thing I noticed is that bad relays have higher than average consensus weights. Here are the average consensus weights for relays with different DNS behaviors:

  • Bad (always timed out): 38612
  • Flaky (sometimes timed out): 10440
  • Good (never timed out): 7042

That trend explains why I see higher timeout frequency with Tor Browser than with my DNS timeout survey script.

comment:29 Changed 14 months ago by teor

Replying to arthuredelstein:

...
One thing I noticed is that bad relays have higher than average consensus weights. Here are the average consensus weights for relays with different DNS behaviors:

  • Bad (always timed out): 38612
  • Flaky (sometimes timed out): 10440
  • Good (never timed out): 7042

That trend explains why I see higher timeout frequency with Tor Browser than with my DNS timeout survey script.

This isn't surprising to me, but it's good to have some data.

I've noticed that guard relays I run in particular locations (east coast US, west coast EU) are overloaded, and others have told me that their exits in similar locations are also overloaded.

If you want to address this issue, we have to either:

  • penalise timeouts more on bandwidth authorities (we need a test environment to make changes like this), or
  • distribute bandwidth authority clients and servers outside EU/US (we're working on this)

We could also find and map overloaded relays, or relays with higher than average consensus weight to bandwidth ratios. That would help us find out if our changes are making a difference. I opened #24045 for this.

comment:30 Changed 14 months ago by Sebastian

I like the first suggestion much more than the second, as I still haven't seen a convincing model where our median voting for BW will actually produce a more balanced network

comment:31 Changed 14 months ago by arthuredelstein

I ran another test, this time with a CircuitStreamTimeout of 100 (seconds):

Overall timeout counts:

  • Always timed out: 15
  • Sometimes timed out: 61
  • Never timed out: 706

The number of stream timeout events were substantially reduced relative to comment:27. It's not a perfect comparison because this was done on a different day, and the total test time took substantially longer. But it does seem to suggest that DNS services for some of the exits are very slow, but not disabled. Next I can look at the distribution of DNS response times. Then I plan to contact some exit relay operators with poor DNS response and see if they can investigate why their DNS is slow to respond or broken.

Direct IP timeouts were again much lower:

  • Always timed out: 2
  • Sometimes timed out: 18
  • Never timed out: 765
Last edited 14 months ago by arthuredelstein (previous) (diff)

comment:32 Changed 14 months ago by arthuredelstein

I looked more closely at timeouts for example.com:

  • 100-second timeouts: 7.13%
  • 10 seconds -> 100 seconds successes: 5.21%
  • Total responses later than 10 seconds: 12.34%

That's actually worse than the 9.03% seen in comment:27. (BTW, it's important to note this is again not weighted by consensus weights but treats all exits equally.)

For bare IP address:

  • 100-second timeouts: 1.00%
  • 10 seconds -> 100 second successes: 2.45%
  • Total responses later than 10 seconds: 3.45%.

Again, this is worse than comment:27, but the discrepancy shows the importance of DNS to delayed responses.

Last edited 14 months ago by arthuredelstein (previous) (diff)

comment:33 Changed 14 months ago by Sebastian

#18580 could be very relevant here?

comment:34 Changed 13 months ago by Sebastian

So yes, I think that bugreport is very relevant here. I think this is a libevent bug/an unfortunate issue with how libevent and unbound interact. By default libevent only allows 64 DNS requests to be inflight at the same time. When asking unbound to resolve something and the dns server in question nullroutes traffic from unbound, it takes longer than named to reply with a SERVFAIL. (named's timeout matches what Tor uses). So if a bigger dns server operator decides it has gotten to many queries from a relay, reaching that limit of 64 inflight queries happens in absolutely no time and all other DNS requests get queued.

There may be an additional issue where we're marking our local unbound as down because it didn't reply quickly enough to a query we made, because it's still trying to answer it. Combined this would easily explain why DNS gets wedged completely.

We REALLY need a high-volume exit operator to help us debug these assumptions.

comment:35 Changed 13 months ago by tom

Cc: tom added

comment:36 Changed 13 months ago by arthuredelstein

gamambel and pastly are helping us with experiments. Following Sebastian's suggestion (derived from Dhalgren's work in #18580), they added

options timeout:5 attempts:2 max-inflight:8192 max-timeouts:1000000

to their resolv.conf, and restarted their exits.

gamambel: HaveHeart, CriticalMass, sofia
pastly: StanMarsh, KyleBroflovksi

I ran my test script again on their exits plus a few more, and here are the results:
https://arthuredelstein.net/tor/21394/exit_results_20171101_2311.json
https://docs.google.com/spreadsheets/d/1yRwTo0OhOCOb48CYFU1eGceX3Iydonzu5uGeia1c6aE/edit?usp=sharing

Unfortunately gamambel's relays seem to have gone down entirely so they're missing from these results -- not sure if the reason is related to the new settings. I will run the test again when they're back up.

I think it's at least promising that DNS is working on pastly's exits. It will be especially convincing that we have found the cause if we revert the resolv.conf settings and DNS stops working again.

Last edited 13 months ago by arthuredelstein (previous) (diff)

comment:37 Changed 13 months ago by arthuredelstein

I ran another round of tests. (I set the timeout back to 10 seconds, because that's a more realistic comparison of what will be seen in Tor Browser.) Results:
https://arthuredelstein.net/tor/21394/exit_results_20171102_0602.json
https://docs.google.com/spreadsheets/d/1TjueUpGvsdXPf22tSZAAQ862e_MPlse8NCGIVD-MAYo/edit?usp=sharing

gamambels exit relays are back up! And interestingly, two of them (HaveHeart and sofia) are showing bad DNS, and one of them (CriticalMass) is showing good DNS. Both of pastly's exits are also showing good DNS as they did in the previous comment.

Last edited 13 months ago by arthuredelstein (previous) (diff)

comment:38 Changed 13 months ago by arthuredelstein

11:06 <+gamambel> arthuredelstein: after the reboot yesterday, resolvconf reset the resolv.conf, so it did not contain the patched 
                  lines
11:06 <+gamambel> i guess i should add them somewhere in the resolvconf configuration instead of editing that file manually
11:06 <+gamambel> all the 3 exits run on the same machine using the same resolv.conf
11:10 <+gamambel> i added the option back now, so you can run the tests again

OK, I have just run the tests again. Results:
https://arthuredelstein.net/tor/21394/exit_results_20171102_1548.json
https://docs.google.com/spreadsheets/d/1rxXJnXCFP4CgUtNAC7L8x4wyDZMOyLq3zz5Aqhui6eE/edit?usp=sharing

All 5 exits (HaveHeart, CriticalMass, sofia, StanMarsh, KyleBroflovski) had no timeouts! That's quite promising. I'm going to keep running the test today to see if that good behavior persists. And maybe after that we can revert resolv.conf and see if the timeouts reappear.

comment:39 Changed 13 months ago by arthuredelstein

I ran three more rounds of tests today, with hours in between each. Results:
https://arthuredelstein.net/tor/21394/exit_results_20171102_2209.json
https://docs.google.com/spreadsheets/d/1p9HLXl-pMPuXOWsr5ZfpiIAAvs1QwljcksnuI8PDtEQ/edit?usp=sharing

In short, our 5 exits showed absolutely no timeouts. In the three rounds in this comment and one round in the previous comment, we had 5 tests per exit connecting to example.com. So that's a total of 100 tests with no timeouts.

Next, I would like to try reverting resolv.conf for these exits and see if DNS starts causing timeouts again.

Last edited 13 months ago by arthuredelstein (previous) (diff)

comment:40 Changed 13 months ago by arthuredelstein

gamambel has reverted resolv.conf to the old settings, and pastly has kept the new settings for now. I ran another test:

https://arthuredelstein.net/tor/21394/exit_results_20171102_2348.json
https://docs.google.com/spreadsheets/d/1CdEObqAKqjobNf0pO-_rr8SLxBlMVW-Rdm7VwOHK0zA/edit?usp=sharing

We see that two of gamambel's exits (HaveHeart and sofia) show significant timeouts, while CriticalMass (as we saw before in comment:37) does not. pastly's exits (new settings) again show no timeouts.

Last edited 13 months ago by arthuredelstein (previous) (diff)

comment:41 Changed 13 months ago by arthuredelstein

I ran another test right after gamambel did a full reboot of his machine, keeping the old settings:
https://arthuredelstein.net/tor/21394/exit_results_20171103_0014.json
https://docs.google.com/spreadsheets/d/1AY0jfbBXaKwDNQ7odiMNctAke9A-mLKAhVTaUgg-eDk/edit?usp=sharing

exit timeouts/connections:
HaveHeart 2/4
sofia 1/5
CriticalMass 0/5

So the timeouts start very soon after startup

comment:42 Changed 13 months ago by Dhalgren

Allow me to assert my original tuning recommendation is well considered and advisable:

resolv.conf -- final recommendation

options timeout:5 attempts:1 max-inflight:16384 max-timeouts:1000000
nameserver 127.0.0.1

https://unbound.net/pipermail/unbound-users/2016-April/004301.html

Where timeout:5 is the usual value appropriate for a Tor daemon (Tor
clients shift to another relay and retry on DNS failures). Attempts:1
assumes that the resolver is a local Unbound instance where Unbound
will handle all timeout retry processing and no UDP loss is possible
between the 'tor' process and the local Unbound, so it's best to give
up directly after five seconds. Max-inflight:4096 {revised to 16384}
both mitigates the DOS scenario experienced and maximizes DNS
performance of the exit relay. Max-timeouts:100 should prevent
eventdns from marking the dedicated local resolver as "down" unless
it really is down. Perhaps max-timeouts:1000000 is better in order
to completely inhibit the timed-out "down resolver" logic.

but per

https://trac.torproject.org/projects/tor/ticket/18580#comment:14

where it turns out that max-timeouts is capped at 255 by eventdns.c.
Will create a patch to remove the 255 limit on next Tor daemon update. . .

Would include the patch but I appear to have overlooked writing it.

comment:43 Changed 13 months ago by Dhalgren

Cc: dhalgren.tor@… added

comment:45 Changed 13 months ago by arthuredelstein

Sebastian created an experimental tor patch (note: not the final patch for this ticket) to apply settings based on Dhalgren's:

index 078bde3ef..08ed219d2 100644
--- a/src/or/dns.c
+++ b/src/or/dns.c
@@ -1439,13 +1439,19 @@ configure_nameservers(int force)
 #define SET(k,v)  evdns_base_set_option(the_evdns_base, (k), (v))
 
   if (evdns_base_count_nameservers(the_evdns_base) == 1) {
-    SET("max-timeouts:", "16");
-    SET("timeout:", "10");
+    SET("max-timeouts:", "255");
+    SET("timeout:", "5");
   } else {
     SET("max-timeouts:", "3");
     SET("timeout:", "5");
   }
 
+  // Elongate the queue of maximum inflight dns requests, so if a bunch
+  // time out at the resolver (happens commonly with unbound) we won't
+  // stall every other DNS request
+  SET("max-inflight:", "8192");
+  SET("attempts:", "1");
+
   if (options->ServerDNSRandomizeCase)
     SET("randomize-case:", "1");
   else

gamambel applied this patch and ran tor. I ran my test and found no timeouts:
https://arthuredelstein.net/tor/21394/exit_results_20171103_0137.json
https://docs.google.com/spreadsheets/d/1ZORq3WlV8-VaTiqoQ4bn3dEN3wkb0z6HkOcrkrG8z7A/edit?usp=sharing

Last edited 13 months ago by arthuredelstein (previous) (diff)

comment:46 Changed 13 months ago by Sebastian

Status: newneeds_review

Branch bug21394 in my repo

comment:47 Changed 13 months ago by Sebastian

Note that the patch *is not* the diff above.

comment:48 Changed 13 months ago by arma

Milestone: Tor: 0.3.3.x-finalTor: 0.3.2.x-final

I like the branch, am happy it got some testing from moritz, and look forward to seeing it in our upcoming alpha.

comment:49 Changed 13 months ago by arma

Status: needs_reviewmerge_ready

switching to merge-ready on the theory that nickm is going to glance over it no matter which state the ticket is in :)

Last edited 13 months ago by arma (previous) (diff)

comment:51 Changed 13 months ago by arthuredelstein

comment:52 Changed 13 months ago by arma

Sounds great.

I think the next steps are:

  • Get it into the upcoming alpha
  • Tell tor-relays that they can either switch to the new alpha, or apply the workaround to their resolv.conf manually, or sit tight to await the results of the broader deployment.

comment:53 in reply to:  42 ; Changed 13 months ago by Dhalgren

Replying to Dhalgren:

Allow me to assert my original tuning recommendation is well considered and advisable:

resolv.conf -- final recommendation

options timeout:5 attempts:1 max-inflight:16384 max-timeouts:1000000
nameserver 127.0.0.1

A the time I came up with the above, my assumption was DNS timeouts are propagated back to clients, but I see per comment:24

Makes sense. If the DNS resolve fails at an exit, does the exit send an error message back to the client? Or does it silently fail, meaning the client has to wait for the full 10-second timeout?

It depends on how it fails.
If the resolve times out at the exit, it also times out at the client,
If the resolve fails fast, a an error cell is sent to the client.

that in this case nothing is sent and the client times-out at 10 seconds independently, therefore for Unbound-specific configurations it should be

options timeout:10 attempts:1 max-inflight:16384 max-timeouts:1000000
nameserver 127.0.0.1

but concede that in the general case where the resolver is either named or Unbound, that

options timeout:5 attempts:2 . . .

is a better choice

Last edited 13 months ago by Dhalgren (previous) (diff)

comment:54 in reply to:  53 ; Changed 13 months ago by arma

Replying to Dhalgren:

that in this case nothing is sent and the client times-out at 10 seconds independently

For added fun, the client starts its 10 second timer when it sends the begin cell, so if it takes x seconds for the begin cell to make it to the exit, and y seconds for a response (failure *or* success) to make it back to the client, then the exit needs to do its thing in 10-x-y seconds or it won't happen in time. Typical values of x and y are 0.5 seconds. So, approximating things as 10 seconds is not crazy, but don't plan to be doing anything new at the 8 or 9 second mark and expect it to have as much impact.

comment:55 in reply to:  54 ; Changed 13 months ago by arma

Replying to arma:

the client starts its 10 second timer when it sends the begin cell

I should also point out that the client's *first* stream attempt is 10 seconds, but subsequent attempts (once it's given up on the first one and tried a new circuit) are 15 seconds each.

comment:56 in reply to:  55 Changed 13 months ago by Dhalgren

Replying to arma:

Replying to arma:

the client starts its 10 second timer when it sends the begin cell

I should also point out that the client's *first* stream attempt is 10 seconds, but subsequent attempts (once it's given up on the first one and tried a new circuit) are 15 seconds each.

Interesting, thank you for the detail. The number of successful resolves arriving between 5 and 15 seconds is on the order of 0.45% and the difference between discarding them and forwarding them not severe enough to cause me regret over configuring this for 18 months. Here is a recent set of Unbound statistics to illustrate. The data seems unlikely to be of actionable use to adversaries and hopefully it's not improper to share publicly:

unbound: [:0] info: server stats for thread 0: 1748646 queries, 368913 answers from cache, 1379733 recursions, 0 prefetch
unbound: [:0] info: server stats for thread 0: requestlist max 331 avg 134.014 exceeded 0 jostled 0
unbound: [:0] info: average recursion processing time 6.735093 sec
unbound: [:0] info: histogram of recursion processing times
unbound: [:0] info: [25%]=0.0130326 median[50%]=0.0450708 [75%]=0.127054
unbound: [:0] info: lower(secs) upper(secs) recursions
unbound: [:0] info:    0.000000    0.000001 39515
unbound: [:0] info:    0.000032    0.000064 1
unbound: [:0] info:    0.000128    0.000256 1
unbound: [:0] info:    0.000256    0.000512 23408
unbound: [:0] info:    0.000512    0.001024 64670
unbound: [:0] info:    0.001024    0.002048 33392
unbound: [:0] info:    0.002048    0.004096 19250
unbound: [:0] info:    0.004096    0.008192 47427
unbound: [:0] info:    0.008192    0.016384 198450
unbound: [:0] info:    0.016384    0.032768 159053
unbound: [:0] info:    0.032768    0.065536 278833
unbound: [:0] info:    0.065536    0.131072 181936
unbound: [:0] info:    0.131072    0.262144 207508
unbound: [:0] info:    0.262144    0.524288 45953
unbound: [:0] info:    0.524288    1.000000 14731
unbound: [:0] info:    1.000000    2.000000 7530
unbound: [:0] info:    2.000000    4.000000 8677
unbound: [:0] info:    4.000000    8.000000 2740
unbound: [:0] info:    8.000000   16.000000 3461
unbound: [:0] info:   16.000000   32.000000 11938
unbound: [:0] info:   32.000000   64.000000 13751
unbound: [:0] info:   64.000000  128.000000 4671
unbound: [:0] info:  128.000000  256.000000 3268
unbound: [:0] info:  256.000000  512.000000 6127
unbound: [:0] info:  512.000000 1024.000000 2086
unbound: [:0] info: 1024.000000 2048.000000 626
unbound: [:0] info: 2048.000000 4096.000000 480
unbound: [:0] info: 4096.000000 8192.000000 200
unbound: [:0] info: 8192.000000 16384.000000 20
unbound: [:0] info: 16384.000000 32768.000000 6
unbound: [:0] info: 32768.000000 65536.000000 1

Unbound displays impressive patience w/r/t completing queries. Arrived at 0.46 by taking the percent of successful resolves under 16 seconds. Is 0.36% relative to queries resolved in under 16 seconds plus resolves from cache. 2.47% of queries resolve in over 16 seconds, are lost.

Last edited 13 months ago by Dhalgren (previous) (diff)

comment:57 Changed 13 months ago by Dhalgren

Just noticed the handy Unbound statistic "requestlist max X avg Y".

Going back I see typically the maximum pending DNS query list runs around 600 with an average length of around 300. Max has been steady but average tripled over 12 months to the present value. Occasional extreme case is 3000 for a day, sometimes two.

comment:58 Changed 13 months ago by Dhalgren

Reviewed the prospective patch and see issues:

1) appears settings override ones present in resolv.conf; if options are set in resolv.conf these should take precedence

2) recommend incorporating recent commentary from above: when one resolver present and no explicit setting in resolv.conf, set attempts:3 in light of comment:55 or at a minimum attempts:2 due to reality that named might be present and attempts:1 will eliminate retries in that scenario

comment:59 in reply to:  58 Changed 13 months ago by Dhalgren

Replying to Dhalgren:

Or simply eliminate overrides for attempts: and timeout: as the defaults are acceptable in both the named and Unbound cases. Key elements here are max-inflight: and max-timeouts:.

comment:60 Changed 13 months ago by nickm

I'm fine with merging Sebastian's patch, though I see there is more discussion following.

Dhalgren, how would it be if we took Sebastian's patch in 0.3.2 now, and then discussed further fixes as changes on top of that?

As an alternative, I'd also look at another patch, but I'm hoping to get this fix tested in another 0.3.2 alpha to be released early this week.

comment:61 Changed 13 months ago by Sebastian

I'm not sure that retrying is a good idea in general. Clients can just try a different exit node. This should be a rare occurence anyway. As for overwriting resolv.conf values - yes, that happens, but libevent doesn't expose an API to do it a different way afaict. We do it already for some stuff. Maybe an acceptable alternative is to allow overwriting these values via torrc parameters?

comment:62 in reply to:  60 Changed 13 months ago by Dhalgren

Replying to nickm:

Dhalgren, how would it be if we took Sebastian's patch in 0.3.2 now, and then discussed further fixes as changes on top of that?

Yes, that's fine.

Replying to Sebastian:

Maybe an acceptable alternative is to allow overwriting these values via torrc parameters?

Sure.

comment:63 Changed 13 months ago by nickm

Keywords: 029-backport 030-backport 031-backport added

Backported Sebastian's branch to 0.2.9 as bug21394_029. Now merging it to 0.3.2 and forward. If it works out, let's backport some more.

comment:64 Changed 13 months ago by nickm

Milestone: Tor: 0.3.2.x-finalTor: 0.3.1.x-final

(Marking for 0.3.1 backport. Could we please have another ticket for discussion of further improvements here?)

comment:65 in reply to:  61 ; Changed 13 months ago by Dhalgren

Keywords: 029-backport 030-backport 031-backport removed
Milestone: Tor: 0.3.1.x-finalTor: 0.3.2.x-final

Replying to Sebastian:

. . . Clients can just try a different exit node

retry here is in regard to DNS query retries within the 10 second tor client timeout window; idea is to have named retry DNS at five seconds -- Unbound performs these automatically but named relies on the requester doing so

comment:66 in reply to:  64 Changed 13 months ago by Dhalgren

Replying to nickm:

Could we please have another ticket for discussion of further improvements here?)

I'm for continuing on #18580 since the background research resides there

comment:67 in reply to:  65 Changed 13 months ago by teor

Keywords: 029-backport 030-backport 031-backport added
Milestone: Tor: 0.3.2.x-finalTor: 0.3.1.x-final

Replying to Dhalgren:

Replying to Sebastian:

. . . Clients can just try a different exit node

retry here is in regard to DNS query retries within the 10 second tor client timeout window; idea is to have named retry DNS at five seconds -- Unbound performs these automatically but named relies on the requester doing so

Restoring keywords and release that were reset in this comment.

comment:68 Changed 13 months ago by arthuredelstein

Thanks to Dhalgren, Sebastian, arma, gamambel and pastly for your work on this!

comment:69 Changed 13 months ago by arthuredelstein

I have a script running every 24 hours to check exits for DNS timeouts. I'll be posting the results at https://arthuredelstein.net/exits

comment:70 Changed 13 months ago by tom

https://ritter.vg/misc/stuff/tor-bwauth-dns-comparison-01.png

I ran two bwauths, that had no differences between them, and calculated the average difference between two sets of relays for each consensus: the set of all relays, and the set of relays where the bw measurement was above 100 in both bwauths.

That data (the blue and green lines) are the control.

Then I changed one bwauth to connect to the final destination using an IP address and left the other one connecting using a DNS name. That data is the the orange and red lines.

I don't see much of a difference. So it seems that bwauth measurements were not affected by this issue.

comment:71 Changed 13 months ago by nickm

Backported this to 0.2.9 and forward.

comment:72 Changed 13 months ago by nickm

I'd like to close this as fixed; can we un-parent the children?

comment:73 Changed 13 months ago by teor

Resolution: fixed
Status: merge_readyclosed

comment:74 Changed 13 months ago by Dhalgren

Resolution: fixed
Status: closedreopened

I object.

Solution is not complete per comment:58. I am willing to work on it if the devs are willing to bring any issues the have to me rather than hacking on my patch as past experience tends to indicate.

comment:75 in reply to:  74 Changed 13 months ago by teor

Replying to Dhalgren:

I object.

Solution is not complete per comment:58.

Hi Dhalgren,

We try to make backported fixes small and obviously correct.
Is there a critical issue that stops this patch working on most machines?
If so, please continue on this ticket.

If not, please open another ticket targeting 0.3.3 for further improvements.

I am willing to work on it if the devs are willing to bring any issues the have to me rather than hacking on my patch as past experience tends to indicate.

I'm sorry, I'm not sure how we failed here, but it's clear that we made some mistakes.

Here's how our process usually works:
We discuss patches on the bug tracker and #tor-dev.
For backported emergency fixes, whoever is available does the work, and we try to get it done as soon as possible.
Sometimes this means that things get forgotten or people get left out. I'm sorry about that.

If you want to tune the parameters, please open another ticket, and we will have more time to consider it.

comment:76 Changed 13 months ago by Dhalgren

A correct patch will be only slightly larger than the existing patch. Adding tuneable parameters is trivial, requires one line for each in or.h and in config.c, and one expression to obtain the value.

The current patch discards successful DNS replies that arrive between 5 and 15 seconds and while the percentage of queries affected is small, the behavior is incorrect. If an operator configures a single named resolver, no retries whatsoever are attempted. 0.2.9 will be around for two more years and a minor incremental effort towards correctness seems in order.

My experience with open source is core developers generally feel compelled to change everything contributed to "make it theirs," without involving the contributor; has left me nonplussed.

comment:77 in reply to:  76 Changed 13 months ago by teor

Replying to Dhalgren:

A correct patch will be only slightly larger than the existing patch. Adding tuneable parameters is trivial, requires one line for each in or.h and in config.c, and one expression to obtain the value.

The bugfix patch in this ticket has been backported and merged, so any new patch has to be built on top of it.

We try to avoid tuneable parameters, because we have to test each option that we put in a torrc. Also, tuneable parameters are a new feature, so they won't get backported.

The current patch discards successful DNS replies that arrive between 5 and 15 seconds and while the percentage of queries affected is small, the behavior is incorrect. If an operator configures a single named resolver, no retries whatsoever are attempted. 0.2.9 will be around for two more years and a minor incremental effort towards correctness seems in order.

Ok, thank you for describing the issue.

Please submit any bugfix patches based on maint-0.2.9, and we'll test it in master, then consider backporting them.

Please submit any feature patches that add new options based on master.

And please use separate tickets for feature and bugfix patches. Otherwise, people are more likely to get confused, and they won't see all of your work.

My experience with open source is core developers generally feel compelled to change everything contributed to "make it theirs," without involving the contributor; has left me nonplussed.

I'm sorry that this happened.
I know I do it myself sometimes, but I try not to.

I don't know exactly what happened in this case, and it's hard to tell by reading the ticket history. Please also understand that sometimes we change things so they are more maintainable.

comment:78 Changed 13 months ago by Dhalgren

I will submit a minimal revision this coming weekend to correct the second paragraph issue, and open a enhancement ticket for tuneable support then as well. The change is to simply to have the one-resolver case set three attempts with a five second timeout, or perhaps five attempts with a three second timeout and I don't feel that has to completed by me if it is expedient for someone else to do so. Requires changing two characters, literally.

Changed 12 months ago by Dhalgren

Attachment: bug21394_touch_up.patch added

recommended touch-up patch

comment:79 Changed 12 months ago by Dhalgren

I uploaded my suggested revision. Targets maint-0.2.9 as requested.

comment:80 Changed 12 months ago by arthuredelstein

Status: reopenedneeds_review

Setting to needs review for Dhalgren's touch-up patch in the previous comment.

comment:81 Changed 12 months ago by Sebastian

I am not sure that retrying in the case of named is actually beneficial, which is why I didn't include it in the patch. Even if we succeed on the later retry it's unlikely that the client will actually benefit from that, because it will also give up on the request.

I had no intention to slight you in any way, I felt that I pushed very hard to actually get your work here recognized and a proper fix merged after your contributions had unfortunately been ignored for a very long time. I am just a volunteer myself and spending this time because I care about having a good outcome for the network.

I feel that it is unfair to suggest that anything I did on this bug I did to make this "mine" or to receive undue credit. I made sure the release notes mention your work and have tried to make everyone aware of who did the analysis. We happen to disagree on a technical point, but that should not be conflated with social issues.

comment:82 in reply to:  81 Changed 12 months ago by Dhalgren

Replying to Sebastian:

I am not sure that retrying in the case of named is actually beneficial, which is why I didn't include it in the patch. Even if we succeed on the later retry it's unlikely that the client will actually benefit from that, because it will also give up on the request.

I'm not following the line of thought. If evdns.c (in libevent) re-attempts the request at five seconds and succeeds, the client knows nothing about it. All the client knows is that it required six-seven-whatever seconds for the circuit connection to complete. In the retried-circuit scenario, it could take twelve, thirteen, on the outside fourteen seconds for success.

Admittedly this will rarely matter (my Unbound histograms say about 0.5% of requests), but if attempts=1 timeout=5 is set, then the exit will literally throw away responses arriving after five seconds and the client will twiddle its thumbs for the remaining five or ten seconds of its timeout interval with no chance of success. My understanding is that evdns all-retries-exhausted events are _not_ relayed back to the client.

I had no intention to slight you in any way, I felt that I pushed very hard to actually get your work here recognized and a proper fix merged after your contributions had unfortunately been ignored for a very long time. I am just a volunteer myself and spending this time because I care about having a good outcome for the network.

Very sorry for any misunderstanding! I was not referring to you or anyone on this ticket above. I was expressing my trepidation at spending time contributing in general because several times in the past I submitted carefully thought-out code changes (to Tor and other projects) that were summarily hacked on for no obvious reason, and without involving me. Took much of the enjoyment out of it.

In regard to this issue I am rather liking the experience. I spent about six weeks understanding the problem, working on the fix, following up the results and documenting what I found on the wiki, and it's a pleasure to have the work recognized and for it to effect a critical improvement to the network--though my original motive was simply to tame a single berserk relay.

Last edited 12 months ago by Dhalgren (previous) (diff)

comment:83 Changed 10 months ago by nickm

Keywords: 030-backport removed

Remove 030-backport from all open tickets that have it: 0.3.0 is now deprecated.

comment:84 in reply to:  83 Changed 10 months ago by cypherpunks

Keywords: 030-backport added
Resolution: fixed
Status: needs_reviewclosed

Replying to nickm:

Remove 030-backport from all open tickets that have it: 0.3.0 is now deprecated.

Except those that should not have been reopened.

comment:85 Changed 10 months ago by teor

Keywords: 030-backport removed
Milestone: Tor: 0.3.1.x-finalTor: 0.3.3.x-final
Resolution: fixed
Status: closedreopened

There are still some changes here in this patch file:
https://trac.torproject.org/projects/tor/attachment/ticket/21394/bug21394_touch_up.patch

They still need review, but they won't be backported to 0.3.0.

Sorry, they didn't get reviewed, because they were assigned to the wrong release.

comment:86 Changed 9 months ago by nickm

Status: reopenedneeds_review

comment:87 Changed 9 months ago by nickm

Keywords: 033-triage-20180320 added

Marking all tickets reached by current round of 033 triage.

comment:88 Changed 9 months ago by nickm

Keywords: 033-included-20180320 added

Mark needs_review tickets as included for 033.

We can revisit this if they turn out to need much more review and/or revision than expected.

comment:89 Changed 9 months ago by dgoulet

Reviewer: nickm

Assign Reviewer to nickm for the 26/03/18 week.

comment:90 Changed 9 months ago by mikeperry

Reviewer: nickmmikeperry

comment:91 Changed 9 months ago by mikeperry

Status: needs_reviewneeds_information

Ok this is a long ticket (with great analysis by all!). I want to clarify a few things here to make sure I understand the situation, since this was closed and reopened and there was already some confusion about target versions.

Let me know if I get any of this wrong:

  1. https://trac.torproject.org/projects/tor/attachment/ticket/21394/bug21394_touch_up.patch is the only thing under consideration for merge, and we're targeting it on 0.3.3. (sha256: f81fbbbdae31dcf643ab87feb70386d2209d4c73bc38479d7a4c798a0d51115b since attachments can get changed).
  2. That patch sets the "attempts" parameter to 3, since a timeout after 5 seconds can be retried once more when the connection timeout is 10 seconds, and twice more when the connection timeout is 15 seconds (on retries), totaling 3 attempts in the 15 second case.
  3. We're also raising max-timeouts in the multiple resolver case to be more patient with resolvers there, too.

All of this seems sane to me.

I have one question, since Dhalgren was concerned about credit. Dhalgren -- do you have a preferred name and email address for git for the patch (or can you attach a git format-patch with your settings?)

comment:92 Changed 9 months ago by Dhalgren

1) yes

2) yes

3) yes, particularly because with multiple resolvers all requests pending on one marked down are requeued to other resolvers--possibly hundreds; in the single resolver case no action is taken and a down-state is advisory, clears on arrival of successful answer and eventdns carries on

please use Dhalgren <dhalgren.tor@…> [dhalgren dot tor at gmail dot com]

thank you!

comment:93 Changed 8 months ago by mikeperry

Status: needs_informationmerge_ready

Ok, the patch has been applied in mikeperry/bug21394_033. Branch is off of maint-0.3.3. Author string has been set as requested.

comment:94 Changed 8 months ago by nickm

Milestone: Tor: 0.3.3.x-finalTor: 0.3.2.x-final

Thanks, everybody!

I've rebased mike's branch onto maint-0.2.9, in case we decide this is a good idea to backport. The new branch is called bug21394_029_redux.

I've merged that branch into 0.3.3 and forward.

comment:95 Changed 7 months ago by teor

Keywords: 032-backport added

comment:96 Changed 5 months ago by teor

Keywords: 031-unreached-backport added; 031-backport removed

0.3.1 is end of life, there are no more backports.
Tagging with 031-unreached-backport instead.

comment:97 Changed 4 weeks ago by teor

Keywords: 032-unreached-backport added; 032-backport removed

0.3.2 is end of life, so 032-backport is now 032-unreached-backport.

comment:98 Changed 4 weeks ago by teor

Milestone: Tor: 0.3.2.x-finalTor: 0.2.9.x-final

These tickets can't be backported to 0.3.2, because it is end of life.
But they can still be backported to 0.2.9.

Note: See TracTickets for help on using tickets.