Opened 6 years ago

Last modified 10 months ago

#12131 assigned project

Measure connectivity patterns between relays

Reported by: arma Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Analysis Version:
Severity: Normal Keywords: network-health
Cc: meejah, phw, atagar, r.a@…, gk, catalyst Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description (last modified by arma) makes me wonder how many relays are firewalling certain outbound ports (and thus messing with connectivity inside the Tor network). It would be great if somebody would start scanning pairs of relays to see which of them can reach each other and which can't, with the goal of understanding how far from a clique our network topology actually is, and then helping with an awareness campaign to correct it if it's a problem.

Tools that might be helpful building blocks here:

Other thoughts:

  • You likely want to turn on FastFirstHopPK on the client, so it doesn't waste cpu power on handshakes at the first relay.
  • If you make each relay connect to 6000 other relays in succession, and some of the relays can't handle 6000 open file descriptors at once, then you might mistakenly misinterpret "could not extend to that relay" as a property of the link between the relays when actually it's a property of the first relay. One option is to scan 500 and then move on to another first hop. Another option is to declare this a feature, and try to detect which relays can and which can't handle 6000 open file descriptors at once.
  • n2 where n is 5000 is actually a heck of a lot of circuits. Should you just build circuits forever in the background, or are there some smarter algorithms for finding interesting patterns without making all 25 million circuits? In particular, there will be a background failure rate anyway, from e.g. relays that happen to be overloaded at that moment. So even 25 million circuits won't be enough.

Child Tickets

Attachments (3)

test.csv (14.4 KB) - added by ra 6 years ago.
bad_inbound_connections.csv (375.1 KB) - added by ra 6 years ago.
bad_outbound_connections.csv (370.2 KB) - added by ra 6 years ago.

Download all attachments as: .zip

Change History (22)

comment:1 Changed 6 years ago by arma

Description: modified (diff)

comment:2 Changed 6 years ago by meejah

I would try to get mikeperry's input on this. I know we spent a little back-and-forth while I was sprucing up exitscanner for his use in Something Meejah Can't Recall, and the definition of "failure" was an issue I *do* remember consuming a lot of typing ;)

The original use-case for that txtorcon-based exit_scanner stuff was to answer questions about the background failure rate of circuits, surrounding the wider question of "is my relay failing Too Many circuits?"

It also seems to me worthwhile brainstorming some way to reduce the 25M edges...For example, "real" clients will always pick a Guard as the first hop, so does it really matter if non-Guard-A can see Guard-A (it seems to me it only matters the other way around). If all potential guards can see all potential middles, and all potential middles can see all potential exits, the network is good, right? This is still probably too many to reasonably scan...but then that set can be partitioned with weights similar to whatever Tor would do so that you're more likely to scan connections that are more likely to be used. "or something".

We did put some work into one of the scanners to let Tor do that choosing as much as possible, I believe...

As a structural note: if anyone wants to take that exit-scanner stuff and run with it, I'd recommend putting it in a new repository that depends on txtorcon as a library -- that "apps/*" directory was just where I happened to shove it since it didn't feel like a "full blown stand-alone app" quite yet. Please let me know if you do this, and I'll delete that branch and point people to the New Thing.

comment:3 Changed 6 years ago by ra

Cc: r.a@… added

comment:4 Changed 6 years ago by ra

I wrote some code to gather the data required. It shares some ideas from my tor-rtt code and is available online:

It just takes the network-status from the point in time when starting the script and builds circuits in parallel while aiming to avoid hammering single nodes. The output is CSV in the format: relay1,relay2,reason,remote_reason. In this case we are looking specifically for remote_reason "CONNECTFAILED".

Changed 6 years ago by ra

Attachment: test.csv added

comment:5 Changed 6 years ago by ra

The attached file shows all failed circuit builds from a test run of 60k circuits - hence, about 12 connections for each relay. Without having looked into details, it seems that there are already some nodes visible having outbound connection problems.

comment:6 Changed 6 years ago by ra

Updated CSV files can be found in the data directory.

comment:7 Changed 6 years ago by gk

Cc: gk added

comment:8 Changed 6 years ago by lunar

ra, could you do a second run so we get an idea of what could be some temporary overloading?

comment:9 Changed 6 years ago by arma

Seems wise to put timestamps on your entries, so we know when what happened.

comment:10 Changed 6 years ago by ra

I stopped the first run at about 2% of all relay pairs and updated the data. The second run is already in progress and will include timestamps. Moreover, I updated the Tor client used to the 0.2.4 series which means that it will provide better circuit build error messages. The second run makes use of more threads so that it will complete almost 10% of all relay pairs per day. I will upload a snapshot of the new data in a few hours.

comment:11 Changed 6 years ago by ra

Interconnectivity between 6730 relays overall was tested during the last 3+ weeks. The connection between any two relays was tested until it could be successfully established - up to 6 times.

I uploaded the raw measurement results gathered to Bitbucket:

Most of the analysis is already done and I will post the results as soon as the last measurement run is finished.

comment:12 Changed 6 years ago by ra

IDHEX: relay's ID.
GOODCONNECTIONS: number of relays to which a connection could be successfully established.
SUSPICIOUSCONNECTIONS: number of relays to which no connection could be established (one or two unsuccessful attempts).
BADCONNECTIONS: number of relays to which no connection could be established (three to six unsuccessful attempts).
CONNECTIONTESTS: total number of connections tested.

comment:13 Changed 6 years ago by ra

Not having analyzed inbound connections, it seems that some relays have serious outbound connection issues. Interestingly, the reason is mostly CHANNEL_CLOSED and not CONNECTFAILED.

comment:14 Changed 6 years ago by ra

Connection issues for relays seem to be either inbound or outbound, but not both.

comment:15 Changed 6 years ago by ra

I forgot to push the final source code and evaluation..

Changed 6 years ago by ra

Attachment: bad_inbound_connections.csv added

Changed 6 years ago by ra

comment:16 Changed 3 years ago by arma

Severity: Normal

See also #19068 for an overlapping ticket.

comment:17 Changed 3 years ago by catalyst

Cc: catalyst added

comment:18 Changed 3 years ago by karsten

Owner: set to metrics-team
Status: newassigned

comment:19 Changed 10 months ago by gk

Keywords: network-health added
Note: See TracTickets for help on using tickets.