Opened 15 months ago

Last modified 2 days ago

#18580 new defect

exit relay fails with 'unbound' DNS resolver when lots of requests time-out

Reported by: Dhalgren Owned by:
Priority: Medium Milestone: Tor: unspecified
Component: Core Tor/Tor Version: Tor: 0.2.7.6
Severity: Major Keywords: tor-relay exit dns unbound needs-diagnosis
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

per

[tor-relays] What does this message mean in my tor logs?
https://lists.torproject.org/pipermail/tor-relays/2016-January/008621.html

[tor-relays] unbound bogs down strangely, degrading exit relay
https://lists.torproject.org/pipermail/tor-relays/2016-March/008918.html

Relay daemon ceases to service Tor Browser requests, timing out, when a local instance of 'unbound' is the DNS resolver and large numbers of DNS requests time-out.

Works fine when 'named' is swapped in place of 'unbound'.

GoDaddy DNS stops responding when large numbers of queries are submitted and this was observed as the particular trigger.

To reproduce, configure the SOA+NS records for several thousand dummy domains to point to a non-responding IP, then generate large numbers of requests against them.

The commands

unbound-control dump_requestlist
unbound-control dump_infra

are helpful for identifying the state.

Have debug-level daemon trace taken when relay was in the unresponsive condition described.

Child Tickets

Change History (24)

comment:1 Changed 15 months ago by s7r

I understand the stuff that is triggering this is GoDaddy's policy to block high volume DNS requesting IP addresses. This will happen regardless of what application is making those DNS requests (unbound, named, etc.) so why exactly is unbound's fault here?

If it's something in unbound and only in unbound, we should patch it upstream since it's not Tor related. If it's something in the Tor daemon which interacts with unbound in such a broken way we should patch, but can you please explain more?

comment:2 Changed 15 months ago by s7r

  • Severity changed from Major to Normal
  • Status changed from new to needs_information

comment:3 Changed 15 months ago by Dhalgren

  • Severity changed from Normal to Major
  • Status changed from needs_information to new

Affects tor running against Unbound. After putting named up the GoDaddy block remains in effect even now. Exit ranked about 60 works perfectly with named and becomes absolutely unusable with Unbound in place. To my eyes that is a major severity bug (though perhaps not high priority). Fairly obvious the problem is tor daemon's handing of some quirk of timed-out request failures done the Unbound way. My guess is nothing wrong with Unbound.

Am the second operator to experience this problem and fix it with named.

Have debug level trace of about 3-5 minutes of the relay-unresponsive state with known test queries being made via TBB. Too big to put here, will upload to another service if someone asks.

comment:4 Changed 15 months ago by Dhalgren

One obvious difference is that named times-out requests in ten seconds and does not fall-back to TCP (with DNSSEC disabled), at least when querying via dig.

Unbound tries for 120 seconds, falling back to TCP. Researched unbound time-outs and did not find a way to modify the behavior.

https://unbound.net/documentation/info_timeout.html

comment:5 follow-up: Changed 15 months ago by arma

So to summarize, it sounds like unbound's behavior when doing a dns resolve is more aggressive than named's behavior? And Godaddy has some sort of abuse detection mechanism that makes it refuse to answer dns questions from loud IP addresses? And whatever unbound is doing is more often triggering godaddy's mechanism?

And while some people on tor-relays thought that this was maybe a Tor bug, it *can't* be a Tor bug if the issue is "the dns server you're asking questions to won't answer"? Or is there still a Tor bug here too, where Tor should handle it better when it doesn't get any dns answer?

comment:6 in reply to: ↑ 5 Changed 15 months ago by Dhalgren

Replying to arma:

So to summarize, it sounds like unbound's behavior when doing a dns resolve is more aggressive than named's behavior?

It appears that Unbound is more persistent than named, but employes a sophisticated exponential back-off scheme so I'm not sure it would be considered more aggressive. The above documentation link goes into the unbound time-out scheme at great length. Named appears to have a much simpler and shorter retry/timeout approach.

And Godaddy has some sort of abuse detection mechanism that makes it refuse to answer dns questions from loud IP addresses?

In 2011 GoDaddy implemented a policy of blocking high-volume DNS requesters in order to avoid adding resources to their DNS server pool. At one point this apparently included blocking GoogleBot. Appears to be a manually maintained list with an arbitrary selection policy. See

http://rscott.org/dns/GoDaddy_Selective_DNS_Blackouts.htm

It appears that my Dhalgren relay was added to their block list three days ago and the 'ashtrayhat3' relay was added back in January. My relay continues to have DNS blocked by GoDaddy. Probably several other fast relays are blocked, but never ran with unbound and so it was not noticed.

And whatever unbound is doing is more often triggering godaddy's mechanism?

I doubt it's unbound (vs named) that caused GoDaddy to block DNS from my exit. They block high-volume DNS requesters in general. I also noticed the ed.gov is blocking my relay.

And while some people on tor-relays thought that this was maybe a Tor bug, it *can't* be a Tor bug if the issue is "the dns server you're asking questions to won't answer"? Or is there still a Tor bug here too, where Tor should handle it better when it doesn't get any dns answer?

I'm 80-90% sure it's a bug in the way the Tor daemon interacts with unbound's behavior w/r/t large numbers of timing-out DNS queries. Unbound appears to be perfectly fine with the situation when it occurs. Tor daemon DNS queries lock-up wholesale, thus preventing normal exit browsing behaivor. Tor daemon is fine with the GoDaddy DNS block when named is the intermediary--large numbers of request time-outs of GoDaddy domains continue unabated.

Data-transfer via circuits appears unaffected as the relay earned a 100% rating increase from the BWauths while it was in the broken state (running 20% of normal traffic load) for 37 hours.

comment:7 Changed 15 months ago by Dhalgren

Here's a small random selection of GoDaddy domains that were timing-out during the incident. Useful for checking whether a GoDaddy DNS block is in effect for a particular resolver. They all have NS records ending with .domaincontrol.com.

akdinfotech.com
omniboxes.com
prontogain.in
bustotal.com
robbinsvillemontessori.com
honeyward.com
midivorciofacil.es
homeswestside.com
pacificnationscuplivestream.com
eshopplace.com
bredagenetics.com
hootersofdelaware.com
holographicbroadcasting.com
joli-bebe.us
isohunters.net
hostwallmedia.com
skysul.com.br
tabletribe.net
torrent4me.com
thedentedhelmet.com
automastersautosales.com
lawncareroswellgeorgia.com
hudsonprecision.com
westechconnect.com
punkoutlawrecords.com
liveinberlin.co
digitalmedia2design.com
ventechlab.com
idahopublichealth.com
hugohigamd.com
hhistoneworks.com
ilitaly.com
hoffmanforjudge.com
torrents.fm
taneshaawasthi.com
itexperiment.com

comment:8 Changed 15 months ago by Dhalgren

Is academic, but it looks like GoDaddy is blocking all traffic to their AS26496 network from my relay. Tested direct connecting to the raw IP of some GoDaddy hosted web-sites and also see nothing but timeouts. I was considering trying to configure named to pull NS domaincontrol.com domains from Google DNS, but clearly there's no point.

comment:9 Changed 15 months ago by Dhalgren

Spent some time accessing relays with control-channel commands. GoDaddy does not appear to be systematically blocking fast Tor relays. However the block stays once applied. 'ashtrayhat3' can resolve GoDaddy DNS queries because the operator pointed DNS at a regional resolver. But 'ashtrayhat3' still cannot connect to addresses in AS26496.

Judging by the 30000 GoDaddy domain lookup attempts found in the unbound cache on my relay, I suspect that 'ashtrayhat3' and 'Dhalgren' were selected by some abuser as preferred nodes in Wordpress login attacks against large numbers of small blog-sites hosted by GoDaddy. GoDaddy decided to null-route the relays because of the volume of abuse.

Tor relays should be able to withstand such blocking and at present it falls apart when unbound is the resolver.

If a fix for this issue is created, I will probably be able to test it for some months going forward due to the persistence of the null-routing.

comment:10 Changed 15 months ago by Dhalgren

Debug-level SafeLogging=1 trace taken while relay was in DNS-frozen state. During trace a failed attempt was made to access the news site known for having most efficient page design. One minute fifteen seconds. Also have a SafeLogging=0 trace if it turns out that would make a difference in analyzing the issue.

https://mega.nz/#!hFtXVarA!YRWOfHktyNpEKEQiovXj-3M3Xoz-1jLvk-3YykkT5do

comment:11 Changed 15 months ago by Dhalgren

After some thought I realized that the dramatically different timeout behavior of Unbound relative to named might be an issue in other use cases. Posted to the unbound-users list about it and they responded with an insightful analysis. It seems possible that an Unbound compatibility feature / option might be created for applications that, like Tor, employ the eventdns component of libevent.

https://unbound.net/pipermail/unbound-users/2016-April/004297.html

comment:12 follow-up: Changed 15 months ago by Dhalgren

On the unbound-users thread Aaron Hopkins was tremendously helpful and led me to what I think is the correct analysis and resolution of this issue:

https://unbound.net/pipermail/unbound-users/2016-April/004301.html

TL;DR version

Unbound /etc/resolv.conf

options timeout:5 attempts:1 max-inflight:4096 max-timeouts:100

named /etc/resolv.conf

options timeout:5 attempts:1 max-inflight:4096 max-timeouts:100

Possibly, even probably max-timeouts:1000000 is better but I'd like hear from a Tor developer on whether completely inhibiting the "down resolver" state in eventdns (even when the resolver is in fact down) is a good idea or not.

comment:13 in reply to: ↑ 12 ; follow-up: Changed 15 months ago by cypherpunks

Replying to Dhalgren:

Unbound /etc/resolv.conf

options timeout:5 attempts:1 max-inflight:4096 max-timeouts:100

named /etc/resolv.conf

options timeout:5 attempts:1 max-inflight:4096 max-timeouts:100

Uhh, those 2 are the same. Just saying...

comment:14 Changed 15 months ago by Dhalgren

Put the exit back to Unbound to see how this works.

Settled on

options timeout:5 attempts:1 max-inflight:16384 max-timeouts:1000000
nameserver 127.0.0.1

where it turns out that max-timeouts is capped at 255 by eventdns.c. Will create a patch to remove the 255 limit on next Tor daemon update. Only purpose for the "down resolver" state is to shift load to a different resolver, but in this situation that's undesirable. Have exactly one local resolver and if it fails an alarm goes off for manual attention.

I might also create an alarm that triggers when

unbound-control dump_requestlist

grows to more than 200 pending requests since that's what was observed during the relay failure. Shouldn't fail now but will be interesting to verify that and examine the next DNS potential DOS situation.

comment:15 in reply to: ↑ 13 Changed 15 months ago by Dhalgren

Oops. Bad paste. . .should be

named /etc/resolv.conf

options timeout:5 attempts:2 max-inflight:4096 max-timeouts:1000000

Where with 'named' a single retry is made after five seconds. 'named' does not retry requests where Unbound has an elaborate retry scheme.

comment:16 Changed 15 months ago by Dhalgren

With Unbound running I did positively confirm that

time dig @127.0.0.1 +tries=1 +time=20 <domain on non-responding NS server>

goes 20 seconds and quits with no response from the resolver where with 'named' the same command shows the resolver sending back a SERVFAIL at 10 seconds. In the above scenario 'unbound-control dump_requestlist' shows the requested domain for the 120 seconds that Unbound attempts to resolve it.

comment:17 Changed 15 months ago by Dhalgren

A downside to max-inflight:16384 is potential performance degradation of the primary Tor-process event loop due to linear queue searches for completed requests. The man(3) page for evdns states

Several algorithms require a full walk of the inflight queue and so bounding its size keeps thing going nicely under huge (many thousands of requests) loads.

My perspective is that the in-flight queue would only grow large under potential DOS scenarios such as the one described in this ticket and paying the performance cost of linear list searches is acceptable (on modern hardware) if it prevents exit relays from becoming unusable. An unlimited size Red-Black binary tree indexed work queue would be ideal, but would necessitate significant development effort to implement.

A mitigating factor is that when the in-flight queue grows large it will predominantly consist of entries that will never receive a response and will time-out in the order they appear on the queue. On the other hand, a majority of completed requests will appear at the very end of the list, suggesting a double-link structure and searching for completions from the end rather than beginning when the in-flight count exceeds a threshold.

Last edited 15 months ago by Dhalgren (previous) (diff)

comment:18 Changed 13 months ago by nickm

  • Sponsor None deleted

Change Core Tor tickets with sponsor "None" to "", per our practice elsewhere. This will help sorting.

comment:19 Changed 11 months ago by nickm

  • Milestone set to Tor: 0.2.???

comment:20 Changed 8 months ago by Dhalgren

After coming to understand the issue and tune for it, I created an alarm that triggers whenever the output from 'unbound-control dump_requestlist' command exceeds 200 entries (400 actually, as two DNS worker threads are configured and the command shows only thread 0). This has been hit twice now and today I managed to look at it while the event was still in progress.

This time instead of GoDaddy, the DNS enumeration target was a major Australian DNS provider and they, like GoDaddy, blocked DNS etc. traffic from the exit node.

The tuning appears to work as the node continued to be usable with good performance when "setconf ExitNodes=<fp>" is applied for testing on a client. I observe a 15% or so drop-off in traffic per the ISP utilization graph at the time of the alarm, with full recovery occurrng in about one hour. Hard to say whether the drop-off was the result of a performance impact from the DNS abuse or whether it's because the attacker gave up using the node just after I started looking--dump_requestlist began shrinking rapidly then.

In any case the node did vastly better than at the time of the original attack before the eventdns tuning was applied. At that time the relay was effectively taken down for two days. That attacker left their abuse-program running and didn't notice that GoDaddy had put and end to the scan.

A look at the utilization graph for the earlier alarm incident showed no performance impact, but that one occurred during off-peak hours when utilization runs significantly lower.

comment:21 Changed 7 months ago by teor

  • Milestone changed from Tor: 0.2.??? to Tor: 0.3.???

Milestone renamed

comment:22 Changed 6 months ago by nickm

  • Keywords tor-03-unspecified-201612 added
  • Milestone changed from Tor: 0.3.??? to Tor: unspecified

Finally admitting that 0.3.??? was a euphemism for Tor: unspecified all along.

comment:23 Changed 5 weeks ago by nickm

  • Keywords tor-03-unspecified-201612 removed

Remove an old triaging keyword.

comment:24 Changed 2 days ago by nickm

  • Keywords tor-relay exit dns unbound needs-diagnosis added
Note: See TracTickets for help on using tickets.