Reproduce Arthur's exit failures and then contact or badexit the relays

added GeorgKoppen202003 actualpoints::3.5 component::community/relays network-health network-health-roadmap-2020Q1 owner::gk points::2 priority::medium resolution::fixed reviewer::dgoulet severity::normal status::closed type::enhancement labels

Trac:
Keywords: N/A deleted, GeorgKoppen202001 added

I just stumbled over #24014 (moved) and #26691 (moved) for more context and a potentially bigger plan.

So, I started looking into this but I don't even get a single successful run so far (I tried twice). After a while, during the third round in the exit relay loop the script, is throwing exceptions and breaks:

main function encountered error
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 501, in errback
    self._startRunCallbacks(fail)
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
    self._runCallbacks()
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1475, in gotResult
    _inlineCallbacks(r, g, status)
--- <exception caught here> ---
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/home/gk/exit-dns/tor_dns_survey/relay_perf.py", line 125, in _main
    exit_results = await test_relays(reactor, state, socks, [guard1], exits, 10, bareIP)
  File "/home/gk/exit-dns/tor_dns_survey/relay_perf.py", line 105, in test_relays
    result = await time_two_hop(reactor, state, socks, relay, exit_node, bareIP)
  File "/home/gk/exit-dns/tor_dns_survey/relay_perf.py", line 76, in time_two_hop
    circuit_results = await build_two_hop_circuit(state, guard, exit_node)
  File "/home/gk/exit-dns/tor_dns_survey/relay_perf.py", line 54, in build_two_hop_circuit
    return { "circuit" : circuit,
builtins.UnboundLocalError: local variable 'circuit' referenced before assignment

I wonder how Arthur is running that and whether he encountered similar bugs. This is with Tor 0.3.5.8, Python 3.7.3, python3-txtorcon 18.3.0-1 on a Debian 10 system.

Trac:
Cc: nusenu, ggus to nusenu, ggus, arthuredelstein

See #29343 (moved) for an older version of this ticket.

Progress. I completed the 10 rounds of exit scanning by patching relay_perf.py:

async def build_two_hop_circuit(state, guard, exit_node):
+    circuit = {}
     success = None
     error = ""
     t_start = time.time()

Now I get a TLS error when connecting to Onionoo:

  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/home/gk/exit-dns/tor_dns_survey/relay_perf.py", line 127, in _main
    exit_results["_relays"] = relay_data(True)
  File "/home/gk/exit-dns/tor_dns_survey/relay_perf.py", line 28, in relay_data
    response = urllib.request.urlopen(req).read()
  File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.7/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.7/urllib/request.py", line 543, in _open
    '_open', req)
  File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 1360, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "/usr/lib/python3.7/urllib/request.py", line 1319, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)>

The problem seems to be gone on my box but I am not sure exactly what the issue was (except it's been a local one). I don't see a similar error on a different Debian Buster machine, freshly set up. Hrm.

I fixed a bunch of issues (patches attached) and have this running now. Need to think about good analysis of the results in a next step (while starting he contact/badexit process in parallel).

Trac:
0001-Initialize-circuit.patch

Trac:
0002-Create-results-directories-if-they-do-not-exist.patch

Trac:
0003-Add-working-exit-FP-for-now.patch

Awesome that you are looking into this Georg! I have the script running daily to generate the results on the website and I haven't run into the errors you saw. But your first patch makes sense to me and I applied it to master.

The other two patches I won't apply because I don't want to break the live site, but I'm happy to try to help with any problems you might be running into.

Replying to arthuredelstein:

Awesome that you are looking into this Georg! I have the script running daily to generate the results on the website and I haven't run into the errors you saw. But your first patch makes sense to me and I applied it to master.

The other two patches I won't apply because I don't want to break the live site, but I'm happy to try to help with any problems you might be running into.

Thanks, you are welcome. The exit relay you specified is down, no? See: https://metrics.torproject.org/rs.html#details/7BD7B547676257EF147F5D5B7A5B15F840F4B579, so you need to pick another one, which my third patch does. Ideally, we would not hard-code a relay here as this breaks from time to time. (And broke for me, hence the patch) I guess a better solution would be to pick a proper exit relay from the relays you have been testing anyway before testing the non-exit-ones. But for now I don't see why you can't take my third patch, like how would it break the live site?

For the second one, yeah, I can see it. If you like I can try to rewrite it in a way that better fits your needs.

If you have some scripts to group the results given some parameters (like "all relays with a DNS error in 80% of the cases during the last n days") I'd be happy to hear about them it would probably smart to have some automated way for at least extracting all the info for bad-exit decisions.

For the scripting part, I played a bit with jq and will start using that for now. We should be more clear about the longer term plan here before investing in a more robust solution but I feel the script(s) I think about writing could easily be re-usable even in that scenario.

In parallel I start reaching out to relay operators to get their setup fixed and/or the relays badexited.

Move my tickets to Feb 2020.

Trac:
Keywords: GeorgKoppen202001 deleted, GeorgKoppen202002 added

Trac:
Points: N/A to 2
Keywords: N/A deleted, network-health-roadmap-2020Q1 added

Moving my tickets.

Trac:
Keywords: GeorgKoppen202002 deleted, GeorgKoppen202003 added

FWIW, I wrote a script that gives me the fingerprints of relays that fail to connect https://eff.org for a threshold of times (it's tried 10 times) and contacted the affected relay operators as far as contact information are available (I started with relays failing 10/10 times comparing both the results of Arthur's test run and my own). I'll start bad-exiting relays later this week and will post some statistics in this ticket as well.

I'll test the script I have further (and probably fine-tune it a bit more) under this week, too. The plan is to have it as part of the helper-scripts repo later on.

Okay, some final note here: I created a script that pulls exit relays failing DNS queries with a certain threshold (by default only relays failing 10/10 times are shown) out of some JSON blob created either by Arthur's exit dns check tool or by my own run. I contacted the respective exit relay ops (that's the "[s]" below where "[sf]" means "mails sent but bounced") last week and did not really hear back (just one replied looking into it). So, now's the time to actually start the badexiting process. I pushed a rule to mark all the exits below as badexit:

[s]                             "$296B2178FD742AB35AB20C9ADF04D5DFD3D407EB"
[s]                             "$3BADB3EFFB87534736BFAC9A2024AB78401BDBC3"
[no email address]              "$4684E03631097C77F013637EC800D499CD71C250"
[s]                             "$51AE5656C81CD417479253A6363A123A007A2233"
[no email address]              "$53FF368902D124FA9A806D149AF22C3A6357B150"
[s]                             "$5AD1D535373C05BB1624BD2A76DDE713E974240E"
[s]                             "$9AD12F0E3CC871D59ACA14BB4076CDD8CB28DE57"
[s]                             "$9C339F4F3101B744C8C040C9F51D63B520D38712"
[sf]                            "$9E9C2223EA179F52BA73A24BFDE2E44DCA468EEF"
[no email address]              "$A5B682E846615088362A3B2BD11C353C84778659"
[s]                             "$AD1639F47D6233E812A67F98F9D76FF55D1D2ECC"
[no email address]              "$F912C0A30DC9CBD4E7BA566C235DA194C4623EC0"
[s]                             "$FAD823A2AA7400D4A8107D7CD83050EEBB7A51FE"
[no email address]              "$FBBC3BD58B471F6227DC0F05265C6A37C770905F"
[s]                             "$FE59C12C9697E742CD3F7ADBAF6385EA1C8B379F"

(FWIW: As said previously I slightly modified Arthur's script to use https://eff.org to check for exits as the results compared with Arthur's allow us to differentiate between DNSSEC only issues and more general ones. That's useful when contacting relay ops in particular until #33179 (moved) is solved).

Marking this ticket as in needs_review for the script I want to add to the helper-scripts repo.

Trac:
Status: assigned to needs_review
Reviewer: N/A to dgoulet

Trac:
Actualpoints: N/A to 3.5

Reproduce Arthur's exit failures and then contact or badexit the relays

Child items 0

Activity