Measure connectivity patterns between relays

added component::metrics/analysis network-health owner::metrics-team priority::medium severity::normal status::assigned type::project labels

Trac:
Description: https://lists.torproject.org/pipermail/tor-relays/2014-May/004598.html makes me wonder how many relays are firewalling certain outbound ports (and thus messing with connectivity inside the Tor network). It would be great if somebody would start scanning pairs of relays to see which of them can reach each other and which can't, with the goal of understanding how far from a clique our network topology actually is, and then helping with an awareness campaign to correct it if it's a problem.

Tools that might be helpful building blocks here:

Meejah's exitscanner builds circuits, and makes sure it isn't building too many at once. Uses txtorcon and thus twisted. https://github.com/meejah/txtorcon/blob/exit_scanner/apps/exit_scanner/guard-exit-coverage.py
phw's exitmap does something similar, but with stem rather than txtorcon. https://gitweb.torproject.org/user/phw/exitmap.git/tree

Other thoughts:

You likely want to turn on FastFirstHopPK on the client, so it doesn't waste cpu power on handshakes at the first relay.
If you make each relay connect to 6000 other relays in succession, and some of the relays can't handle 6000 open file descriptors at once, then you might mistakenly misinterpret "could not extend to that relay" as a property of the link between the relays when actually it's a property of the first relay. One option is to scan 500 and then move on to another first hop. Another option is to declare this a feature, and try to detect which relays can and which can't handle 6000 open file descriptors at once.
n^2 where n is 5000 is actually a heck of a lot of circuits. Should you just build circuits forever in the background, or are there some smarter algorithms for finding interesting patterns without making all 25 million circuits? In particular, there will be a background failure rate anyway, from e.g. relays that happen to be overloaded at that moment. So even 25 million circuits won't be enough.

to

https://lists.torproject.org/pipermail/tor-relays/2014-May/004598.html makes me wonder how many relays are firewalling certain outbound ports (and thus messing with connectivity inside the Tor network). It would be great if somebody would start scanning pairs of relays to see which of them can reach each other and which can't, with the goal of understanding how far from a clique our network topology actually is, and then helping with an awareness campaign to correct it if it's a problem.

Tools that might be helpful building blocks here:

Meejah's exitscanner builds circuits, and makes sure it isn't building too many at once. Uses txtorcon and thus twisted. https://github.com/meejah/txtorcon/blob/exit_scanner/apps/exit_scanner/guard-exit-coverage.py
phw's exitmap does something similar, but with stem rather than txtorcon. https://gitweb.torproject.org/user/phw/exitmap.git/tree

Other thoughts:

You likely want to turn on FastFirstHopPK on the client, so it doesn't waste cpu power on handshakes at the first relay.
If you make each relay connect to 6000 other relays in succession, and some of the relays can't handle 6000 open file descriptors at once, then you might mistakenly misinterpret "could not extend to that relay" as a property of the link between the relays when actually it's a property of the first relay. One option is to scan 500 and then move on to another first hop. Another option is to declare this a feature, and try to detect which relays can and which can't handle 6000 open file descriptors at once.
n^2^ where n is 5000 is actually a heck of a lot of circuits. Should you just build circuits forever in the background, or are there some smarter algorithms for finding interesting patterns without making all 25 million circuits? In particular, there will be a background failure rate anyway, from e.g. relays that happen to be overloaded at that moment. So even 25 million circuits won't be enough.

I would try to get mikeperry's input on this. I know we spent a little back-and-forth while I was sprucing up exitscanner for his use in Something Meejah Can't Recall, and the definition of "failure" was an issue I do remember consuming a lot of typing ;)

The original use-case for that txtorcon-based exit_scanner stuff was to answer questions about the background failure rate of circuits, surrounding the wider question of "is my relay failing Too Many circuits?"

It also seems to me worthwhile brainstorming some way to reduce the 25M edges...For example, "real" clients will always pick a Guard as the first hop, so does it really matter if non-Guard-A can see Guard-A (it seems to me it only matters the other way around). If all potential guards can see all potential middles, and all potential middles can see all potential exits, the network is good, right? This is still probably too many to reasonably scan...but then that set can be partitioned with weights similar to whatever Tor would do so that you're more likely to scan connections that are more likely to be used. "or something".

We did put some work into one of the scanners to let Tor do that choosing as much as possible, I believe...

As a structural note: if anyone wants to take that exit-scanner stuff and run with it, I'd recommend putting it in a new repository that depends on txtorcon as a library -- that "apps/*" directory was just where I happened to shove it since it didn't feel like a "full blown stand-alone app" quite yet. Please let me know if you do this, and I'll delete that branch and point people to the New Thing.

Trac:
Cc: meejah, phw, atagar to meejah, phw, atagar, r.a@posteo.net

I wrote some code to gather the data required. It shares some ideas from my tor-rtt code and is available online: https://bitbucket.org/ra_/tor-relay-connectivity/

It just takes the network-status from the point in time when starting the script and builds circuits in parallel while aiming to avoid hammering single nodes. The output is CSV in the format: relay1,relay2,reason,remote_reason. In this case we are looking specifically for remote_reason "CONNECTFAILED".

Trac:
test.csv

The attached file shows all failed circuit builds from a test run of 60k circuits - hence, about 12 connections for each relay. Without having looked into details, it seems that there are already some nodes visible having outbound connection problems.

Updated CSV files can be found in the data directory.

Trac:
Cc: meejah, phw, atagar, r.a@posteo.net to meejah, phw, atagar, r.a@posteo.net, gk

ra, could you do a second run so we get an idea of what could be some temporary overloading?

Seems wise to put timestamps on your entries, so we know when what happened.

I stopped the first run at about 2% of all relay pairs and updated the data. The second run is already in progress and will include timestamps. Moreover, I updated the Tor client used to the 0.2.4 series which means that it will provide better circuit build error messages. The second run makes use of more threads so that it will complete almost 10% of all relay pairs per day. I will upload a snapshot of the new data in a few hours.

Interconnectivity between 6730 relays overall was tested during the last 3+ weeks. The connection between any two relays was tested until it could be successfully established - up to 6 times.

I uploaded the raw measurement results gathered to Bitbucket: https://bitbucket.org/ra_/tor-relay-connectivity/downloads

Most of the analysis is already done and I will post the results as soon as the last measurement run is finished.

IDHEX: relay's ID. GOODCONNECTIONS: number of relays to which a connection could be successfully established. SUSPICIOUSCONNECTIONS: number of relays to which no connection could be established (one or two unsuccessful attempts). BADCONNECTIONS: number of relays to which no connection could be established (three to six unsuccessful attempts). CONNECTIONTESTS: total number of connections tested.

Not having analyzed inbound connections, it seems that some relays have serious outbound connection issues. Interestingly, the reason is mostly CHANNEL_CLOSED and not CONNECTFAILED.

Connection issues for relays seem to be either inbound or outbound, but not both.

I forgot to push the final source code and evaluation..

Trac:
bad_inbound_connections.csv

Trac:
bad_outbound_connections.csv

See also #19068 (moved) for an overlapping ticket.

Trac:
Severity: N/A to Normal
Reviewer: N/A to N/A
Sponsor: N/A to N/A

Measure connectivity patterns between relays

Child items 0

Activity