Opened 2 years ago

Closed 2 years ago

#24010 closed defect (wontfix)

Make bandwidth authorities use DNS, not IP addresses

Reported by: teor Owned by: aagbsn
Priority: High Milestone:
Component: Core Tor/Torflow Version:
Severity: Normal Keywords:
Cc: arthuredelstein Actual Points:
Parent ID: #21394 Points: 1
Reviewer: Sponsor:

Description

Currently, bandwidth authorities don't see DNS timeouts, because they use IP addresses to connect to bandwidth servers.

We should encourage bandwidth authority operators to use DNS instead, so we down-rate exits with poor DNS connectivity.

Child Tickets

Change History (11)

comment:1 Changed 2 years ago by teor

Found during the analysis of #21394.

comment:2 Changed 2 years ago by arthuredelstein

Cc: arthuredelstein added

comment:3 Changed 2 years ago by teor

I sent an email to the directory authority list about this.

comment:4 Changed 2 years ago by teor

Parent ID: #21394

comment:5 Changed 2 years ago by micah

This strikes me as adding a potentially fragile layer to an already teetering edifice.

The domain name that I would have used for this was recently 'blocked' by 1/3rd of all of home cable users in Chile because they made a mistake in attributing wannacry as coming from our exit node, or maybe even directory authority. It took months to track this down, along with in-person meetings with the ISP.

DNS is also frequently the easiest thing for overzealous countries to block. If we depend on it for bandwidth scanning, I feel like we are adding a layer to the system that enables the entire stack to easily fall down when pushed at the top. Don't like tor running in your country? Just block these domain names from resolving and it will cause all relays in your country to get penalized by tor's bandwidth scanners so much that they are useless. If we were to do this, then I would say that bandwidth scanner web servers should be reached over different domain names, so that if one is blocked, the other is not also impacted.

DNS can also be a bit funny, caching and inability to look up certain information, but no problems with other information. Having a server lookup one hostname every pass of the bandwidth scanner is likely just going to result in testing that the DNS can resolve once properly, and then cache that result for an unpredictable amount of time (depending on the DNS SOA record for the domain in question, the local resolver settings and the leaf resovlers up).

I get the point of doing this, but I am not convinced that this should be the role of bandwidth scanners. Bandwidth scanners should be simply testing the speed of the network, and nothing else. Its already overly complicated, even for that one task. I think DNS reachability tests are important, and the problem does need to be fixed, but I wonder if this should be done some other way. Perhaps in the client itself? I am unsure.

comment:6 in reply to:  5 ; Changed 2 years ago by teor

Replying to micah:

This strikes me as adding a potentially fragile layer to an already teetering edifice.


I get the point of doing this, but I am not convinced that this should be the role of bandwidth scanners. Bandwidth scanners should be simply testing the speed of the network, and nothing else. Its already overly complicated, even for that one task. I think DNS reachability tests are important, and the problem does need to be fixed, but I wonder if this should be done some other way. Perhaps in the client itself? I am unsure.

If the role of bandwidth scanners is to measure bandwidth *as clients experience it*, then using at least some DNS is appropriate.
We could use a mix of DNS and IP, because that's what clients do. And if we use a CDN as the server, it will need DNS.

Maybe clients should give up on timed out circuits faster, I opened #24022 for this.

Also, exits can check their own DNS (#24014), but judging what is a slow resolve is hard, because it needs a comparison to other exits.

comment:7 in reply to:  6 Changed 2 years ago by arthuredelstein

Replying to teor:

If the role of bandwidth scanners is to measure bandwidth *as clients experience it*, then using at least some DNS is appropriate.
We could use a mix of DNS and IP, because that's what clients do. And if we use a CDN as the server, it will need DNS.

I tend to agree with micah that we shouldn't conflate measuring bandwidth with DNS resolver failure rate. These are two different measurements, and have different observable effects in clients. In Tor Browser, we see frequent DNS resolver failures, which cause very long delays in first connecting to a website (ten or twenty seconds).

But I do think it might be a good approach for bandwidth authorities to provide a second, separate service of measuring resolver failure rate. I agree it might require using a large pool of domain names to avoid being vulnerable to an attack by ISP or host country.

Also, exits can check their own DNS (#24014), but judging what is a slow resolve is hard, because it needs a comparison to other exits.

I don't think you need to compare with other exits. We know that tor has a hard-coded 10-second timeout. If the DNS resolver takes longer than 10 seconds, then that should be counted as a failure. Obviously, whether it's self-reporting by the exit or measurement by a bandwidth authority, you'd want to pick a threshold failure rate above which exits are penalized or their exit status is disabled.

comment:8 Changed 2 years ago by Sebastian

I used DNS to set up my bwauth but it doesn't measure the slice with the fast relays in a reasonable time. Excerpt from htop:

 1632 bwscanner  39  19  445M  181M  9044 S  1.3  2.3  5:43.95 python bwauthority_child.py ./data/scanner.1/bwauthority.cfg 1
14584 bwscanner  39  19  402M  147M  8852 S  0.0  1.9  0:00.00 python bwauthority_child.py ./data/scanner.2/bwauthority.cfg 7
14583 bwscanner  39  19  395M  144M  8900 S  0.0  1.9  0:00.01 python bwauthority_child.py ./data/scanner.3/bwauthority.cfg 10
14587 bwscanner  39  19  391M  142M  8836 S  0.0  1.9  0:00.00 python bwauthority_child.py ./data/scanner.4/bwauthority.cfg 7
12572 bwscanner  39  19  393M  146M  9024 S  3.3  1.9  2:42.51 python bwauthority_child.py ./data/scanner.5/bwauthority.cfg 12
14555 bwscanner  39  19  404M  184M  8896 S  0.0  2.4  0:00.03 python bwauthority_child.py ./data/scanner.6/bwauthority.cfg 10
12742 bwscanner  39  19  463M  143M  8980 S  0.0  1.9  0:02.83 python bwauthority_child.py ./data/scanner.7/bwauthority.cfg 11
10627 bwscanner  39  19  391M  143M  8852 S  0.0  1.9  0:03.80 python bwauthority_child.py ./data/scanner.8/bwauthority.cfg 10
11385 bwscanner  39  19  462M  142M  8900 S  0.0  1.8  0:03.37 python bwauthority_child.py ./data/scanner.9/bwauthority.cfg 1

AFAIK, the number at the end of the line can be used to indicate progress in running through a slice after first startup. The above is paired with this:

WARN[Wed Nov 01 08:45:09 2017]:Only measured 48.000000 of the previous consensus bandwidth despite measuring 65.100000 of the nodes

To me that sounds like those reqlly fast nodes that are failing DNS now just simply slow down the measurement process instead of my bwauth deciding that they are unsuitable.

Last edited 2 years ago by Sebastian (previous) (diff)

comment:9 Changed 2 years ago by Sebastian

This doesn't seem to actually do anything other than drastically slow down measurement of fast relays.

comment:11 Changed 2 years ago by teor

Resolution: wontfix
Status: newclosed

Excellent, let's not change anything then: operators can use DNS or IP, and it won't make a difference.

Note: See TracTickets for help on using tickets.