TorDNSel exit lists are missing expected data

added TorCheck component::core tor/tordnsel priority::medium reporter::Ry severity::normal status::new type::defect labels

TorCheck had issues in the past months (https://blog.torproject.org/blog/tor-check-outage-03-and-04-july-2013), so maybe it's related to that. Can you re-run your analysis on the archives of, say, all of 2013? It would be interesting to know whether this is a new problem or not.

Trac:
Priority: major to normal
Status: new to assigned
Owner: phobos to N/A

tordnsel runs a standard tor client, if the tor client isn't seeing the whole consensus, then we have far larger issues.

Cc'ing tup as the original tordnsel developer, and Lunar as somebody who's looked at it more recently.

I guess step zero is to confirm or deny phobos's theory above that the Tor client it's using somehow doesn't inform it about all the relays.

Trac:
Cc: N/A to tup, lunar

Trac:
Cc: tup, lunar to tup, lunar, arlo@torproject.org

Replying to karsten:

TorCheck had issues in the past months (https://blog.torproject.org/blog/tor-check-outage-03-and-04-july-2013), so maybe it's related to that. Can you re-run your analysis on the archives of, say, all of 2013? It would be interesting to know whether this is a new problem or not. AFAIK that was due to implementation details of the older(still current) TorCheck version which wasn't reliant on published exit-lists as much. I can surely check across 2013 at the weekend perhaps (Don't really have the connection to actually download that much data right this moment)

As for the Tor client not seeing the whole consensus, I wasn't exactly sure what you were suggesting, so I did two things. For my Tor client (built off master), I ran two scripts:

https://gist.github.com/Ryman/ee2c11e7107a81c926c8 Comparing the cached-consensus with the latest consensus on the metrics server that I used for the OP. There have been no differences for the last two hours for which I tried. (If you need to test, there's a comment with a helpful rsync command that will pull the latest consensus doc, if you change the date/hour)
https://gist.github.com/Ryman/1f2e76d7465abb7ea603 Comparing the fingerprints present in the cached consensus document and a merge of both cached-descriptors and cached-descriptors.new. In this comparison I have found a really tiny number of servers are missing (3 when I started testing, seemed to stick at 2 for an hour throughout 2 different consensus files) If you don't want to compare against .new files then supply a -s flag, but I think this would be an error to do so?

From this, I conclude that the client likely sees enough of the consensus, so that the discrepancy in the OP is not caused by that specifically.

It's likely worth running the scripts against the version of Tor client being used by TorDNSEL in production just as a sanity check. I think arma might be on the right lines in that it's potentially a problem between TorDNSEL and Tor, or TorDNSEL is perhaps timing out connections and never retrying/logging them?

Trac:
Username: Ry

Some speculation:

TorDNSEL's conf currently looks like,

TestDestinationAddress 38.229.72.22:8080,8443,110,5190,6667,6697,9030

I wonder how many of the above relays allow exiting but just not to those ports or that IP? From the false negative work we've been doing on check, there're at least two that can only exit on 443 and will never be picked up. What's the best set of ports to run on? Is this going to account for half the exits?

The data used in the above investigation was from the metrics project. I noticed the cron that collects the data runs at around 2 minutes past the hour but TorDNSEL is busy collecting new data for what looks like 20 minutes. Is it getting everything in exit-addresses.new? Should probably rerun the tests with a file straight from TorDNSEL in production to confirm the above.

Replying to arlolra:

The data used in the above investigation was from the metrics project. I noticed the cron that collects the data runs at around 2 minutes past the hour but TorDNSEL is busy collecting new data for what looks like 20 minutes. Is it getting everything in exit-addresses.new? Should probably rerun the tests with a file straight from TorDNSEL in production to confirm the above.

That's correct, the cronjob runs at 2 minutes past the hour. If there's a better time to fetch the time, I can change that to any minute past the hour. I ran a quick experiment where I downloaded the file every minute today between 11:04 and 12:09 UTC. Here are the file sizes and last-modified times:

116408 11:04
116408 11:05
116408 11:06
116408 11:06
116408 11:06
116408 11:06
116408 11:06
116408 11:06
116408 11:06
116408 11:06
116408 11:06
116408 11:06
116408 11:06
116408 11:06
116408 11:06
116408 11:06
116408 11:06
116408 11:06
116408 11:06
116408 11:06
116408 11:06
116408 11:06
116408 11:06
116408 11:06
116408 11:06
116721 11:29
116721 11:30
116721 11:31
117034 11:32
117034 11:33
117034 11:34
117034 11:35
117191 11:36
117191 11:37
117350 11:38
117350 11:39
117351 11:40
117509 11:41
117509 11:42
117509 11:42
117509 11:42
117509 11:42
117509 11:42
117509 11:42
117509 11:42
117509 11:42
117509 11:42
117509 11:42
117509 11:42
117509 11:42
117509 11:42
117509 11:42
117509 11:42
117509 11:42
117509 11:42
117509 11:42
117509 11:42
117509 11:42
117666 12:02
117822 12:03
117822 12:04
117822 12:05
117822 12:06
117822 12:06
117822 12:06
117822 12:06

It seems that the :06 and :42 files are left unchanged for long enough to fetch them. So, should I change the cronjob to either 25 or 55 minutes past the hour?

Or should I learn more about the exit-addresses.new file and maybe fetch that one, too?

Set all open tickets without a severity to "Normal"

Trac:
Severity: N/A to Normal

Mark all tickets that are assigned to nobody as "new".

Trac:
Status: assigned to new

moved to tpo/core/tordnsel#9765

TorDNSel exit lists are missing expected data

Child items ...

Activity