Opened 10 months ago

Closed 7 months ago

#32864 closed enhancement (fixed)

Reproduce Arthur's exit failures and then contact or badexit the relays

Reported by: arma Owned by: gk
Priority: Medium Milestone:
Component: Community/Relays Version:
Severity: Normal Keywords: network-health, network-health-roadmap-2020Q1, GeorgKoppen202003
Cc: nusenu, ggus, arthuredelstein Actual Points: 3.5
Parent ID: Points: 2
Reviewer: dgoulet Sponsor:

Description

https://arthuredelstein.net/exits/
lists a pile of exit relays, including some very fast exit relays, that are failing all of their dns queries. That is, they claim to be exits but Tor clients probably rarely use them, yet clients still *try* to use them, contributing to the long tail of low-probability high-impact misery of being a Tor client.

We should verify that we agree with his scripts, and also make sure we are comfortable running the checks on our own.

Then we should contact the affected relays, and either get them to fix their dns, or figure out what the bug is, or failing all of that, set the badexit flag for them to save clients the trouble of trying them and failing.

Then once we've done a round of that, we should come up with a process by which we repeat it regularly.

Child Tickets

Attachments (3)

Change History (21)

comment:1 Changed 10 months ago by gk

Keywords: GeorgKoppen202001 added

comment:2 Changed 9 months ago by gk

I just stumbled over #24014 and #26691 for more context and a potentially bigger plan.

comment:3 Changed 9 months ago by gk

Cc: arthuredelstein added

So, I started looking into this but I don't even get a single successful run so far (I tried twice). After a while, during the third round in the exit relay loop the script, is throwing exceptions and breaks:

main function encountered error
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 501, in errback
    self._startRunCallbacks(fail)
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
    self._runCallbacks()
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1475, in gotResult
    _inlineCallbacks(r, g, status)
--- <exception caught here> ---
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/gk/exit-dns/tor_dns_survey/relay_perf.py", line 125, in _main
    exit_results = await test_relays(reactor, state, socks, [guard1], exits, 10, bareIP)
  File "/home/gk/exit-dns/tor_dns_survey/relay_perf.py", line 105, in test_relays
    result = await time_two_hop(reactor, state, socks, relay, exit_node, bareIP)
  File "/home/gk/exit-dns/tor_dns_survey/relay_perf.py", line 76, in time_two_hop
    circuit_results = await build_two_hop_circuit(state, guard, exit_node)
  File "/home/gk/exit-dns/tor_dns_survey/relay_perf.py", line 54, in build_two_hop_circuit
    return { "circuit" : circuit,
builtins.UnboundLocalError: local variable 'circuit' referenced before assignment

I wonder how Arthur is running that and whether he encountered similar bugs. This is with Tor 0.3.5.8, Python 3.7.3, python3-txtorcon 18.3.0-1 on a Debian 10 system.

Last edited 9 months ago by gk (previous) (diff)

comment:4 Changed 9 months ago by arma

See #29343 for an older version of this ticket.

comment:5 Changed 9 months ago by gk

Progress. I completed the 10 rounds of exit scanning by patching relay_perf.py:

async def build_two_hop_circuit(state, guard, exit_node):
+    circuit = {}
     success = None
     error = ""
     t_start = time.time()

Now I get a TLS error when connecting to Onionoo:

  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/home/gk/exit-dns/tor_dns_survey/relay_perf.py", line 127, in _main
    exit_results["_relays"] = relay_data(True)
  File "/home/gk/exit-dns/tor_dns_survey/relay_perf.py", line 28, in relay_data
    response = urllib.request.urlopen(req).read()
  File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.7/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.7/urllib/request.py", line 543, in _open
    '_open', req)
  File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 1360, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "/usr/lib/python3.7/urllib/request.py", line 1319, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)>

comment:6 Changed 9 months ago by gk

The problem seems to be gone on my box but I am not sure exactly what the issue was (except it's been a local one). I don't see a similar error on a different Debian Buster machine, freshly set up. Hrm.

comment:7 Changed 9 months ago by gk

I fixed a bunch of issues (patches attached) and have this running now. Need to think about good analysis of the results in a next step (while starting he contact/badexit process in parallel).

Changed 9 months ago by gk

comment:8 Changed 9 months ago by arthuredelstein

Awesome that you are looking into this Georg! I have the script running daily to generate the results on the website and I haven't run into the errors you saw. But your first patch makes sense to me and I applied it to master.

The other two patches I won't apply because I don't want to break the live site, but I'm happy to try to help with any problems you might be running into.

comment:9 in reply to:  8 Changed 9 months ago by gk

Replying to arthuredelstein:

Awesome that you are looking into this Georg! I have the script running daily to generate the results on the website and I haven't run into the errors you saw. But your first patch makes sense to me and I applied it to master.

The other two patches I won't apply because I don't want to break the live site, but I'm happy to try to help with any problems you might be running into.

Thanks, you are welcome. The exit relay you specified is down, no? See: https://metrics.torproject.org/rs.html#details/7BD7B547676257EF147F5D5B7A5B15F840F4B579, so you need to pick another one, which my third patch does. Ideally, we would not hard-code a relay here as this breaks from time to time. (And broke for me, hence the patch) I guess a better solution would be to pick a proper exit relay from the relays you have been testing anyway before testing the non-exit-ones. But for now I don't see why you can't take my third patch, like how would it break the live site?

For the second one, yeah, I can see it. If you like I can try to rewrite it in a way that better fits your needs.

If you have some scripts to group the results given some parameters (like "all relays with a DNS error in 80% of the cases during the last n days") I'd be happy to hear about them it would probably smart to have some automated way for at least extracting all the info for bad-exit decisions.

comment:10 Changed 9 months ago by gk

For the scripting part, I played a bit with jq and will start using that for now. We should be more clear about the longer term plan here before investing in a more robust solution but I feel the script(s) I think about writing could easily be re-usable even in that scenario.

In parallel I start reaching out to relay operators to get their setup fixed and/or the relays badexited.

Last edited 9 months ago by gk (previous) (diff)

comment:11 Changed 9 months ago by gk

Keywords: GeorgKoppen202002 added; GeorgKoppen202001 removed

Move my tickets to Feb 2020.

comment:12 Changed 9 months ago by gk

Keywords: network-health-roadmap-2020Q1 added
Points: 2

comment:13 Changed 8 months ago by gk

Keywords: GeorgKoppen202003 added; GeorgKoppen202002 removed

Moving my tickets.

comment:14 Changed 7 months ago by gk

FWIW, I wrote a script that gives me the fingerprints of relays that fail to connect https://eff.org for a threshold of times (it's tried 10 times) and contacted the affected relay operators as far as contact information are available (I started with relays failing 10/10 times comparing both the results of Arthur's test run and my own). I'll start bad-exiting relays later this week and will post some statistics in this ticket as well.

I'll test the script I have further (and probably fine-tune it a bit more) under this week, too. The plan is to have it as part of the helper-scripts repo later on.

comment:15 Changed 7 months ago by gk

Reviewer: dgoulet
Status: assignedneeds_review

Okay, some final note here: I created a script that pulls exit relays failing DNS queries with a certain threshold (by default only relays failing 10/10 times are shown) out of some JSON blob created either by Arthur's exit dns check tool or by my own run. I contacted the respective exit relay ops (that's the "[s]" below where "[sf]" means "mails sent but bounced") last week and did not really hear back (just one replied looking into it). So, now's the time to actually start the badexiting process. I pushed a rule to mark all the exits below as badexit:

[s]                             "$296B2178FD742AB35AB20C9ADF04D5DFD3D407EB"
[s]                             "$3BADB3EFFB87534736BFAC9A2024AB78401BDBC3"
[no email address]              "$4684E03631097C77F013637EC800D499CD71C250"
[s]                             "$51AE5656C81CD417479253A6363A123A007A2233"
[no email address]              "$53FF368902D124FA9A806D149AF22C3A6357B150"
[s]                             "$5AD1D535373C05BB1624BD2A76DDE713E974240E"
[s]                             "$9AD12F0E3CC871D59ACA14BB4076CDD8CB28DE57"
[s]                             "$9C339F4F3101B744C8C040C9F51D63B520D38712"
[sf]                            "$9E9C2223EA179F52BA73A24BFDE2E44DCA468EEF"
[no email address]              "$A5B682E846615088362A3B2BD11C353C84778659"
[s]                             "$AD1639F47D6233E812A67F98F9D76FF55D1D2ECC"
[no email address]              "$F912C0A30DC9CBD4E7BA566C235DA194C4623EC0"
[s]                             "$FAD823A2AA7400D4A8107D7CD83050EEBB7A51FE"
[no email address]              "$FBBC3BD58B471F6227DC0F05265C6A37C770905F"
[s]                             "$FE59C12C9697E742CD3F7ADBAF6385EA1C8B379F"

(FWIW: As said previously I slightly modified Arthur's script to use https://eff.org to check for exits as the results compared with Arthur's allow us to differentiate between DNSSEC only issues and more general ones. That's useful when contacting relay ops in particular until #33179 is solved).

Marking this ticket as in needs_review for the script I want to add to the helper-scripts repo.

Last edited 7 months ago by gk (previous) (diff)

comment:16 Changed 7 months ago by gk

Actual Points: 3.5

comment:17 Changed 7 months ago by dgoulet

Status: needs_reviewmerge_ready

Good to go.

comment:18 in reply to:  17 Changed 7 months ago by gk

Resolution: fixed
Status: merge_readyclosed

Replying to dgoulet:

Good to go.

Thanks. Merged to master (commit b40a0e100e8d3e6d1503076ddb6ddfbf05346a59). There are follow-up tickets we can work on and then think about setting some infrastructure up to allow us doing the badexiting easier. We are good for this ticket, though.

Note: See TracTickets for help on using tickets.