Opened 7 years ago

Closed 7 years ago

#6266 closed defect (fixed)

maxmind geoip db is starting to label Tor relays as country "A1"

Reported by: arma Owned by:
Priority: Medium Milestone: Tor: 0.2.3.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: tor-client
Cc: karsten Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

The more A1 relays we have, the more useless it is to think settings EntryNodes or ExitNodes by country works. Especially for people who are trying to Exclude countries.

Child Tickets

TicketStatusOwnerSummaryComponent
#6438closedEvaluate software77's geoip databaseCore Tor/Tor

Attachments (2)

geoipclean.py (1.6 KB) - added by nickm 7 years ago.
deanonymind.py (3.6 KB) - added by karsten 7 years ago.

Download all attachments as: .zip

Change History (20)

comment:1 Changed 7 years ago by arma

Is this something that we can isolate in the db and just remove those lines? That is, if we find a line

212122048,212491249,US
212491250,212491250,A1
212491251,212733527,US

we can likely just merge those three lines into one.

But cases like

531334528,531334655,US
531334656,531335167,A1
531335168,531337215,IT

are not so easy.

So the first question is, are the Tor relays that they're adding only single IP blocks?

comment:2 Changed 7 years ago by arma

Cc: karsten added

comment:3 Changed 7 years ago by karsten

The more A1 relays we have, the more useless it is to think settings
EntryNodes or ExitNodes by country works. Especially for people who are
trying to Exclude countries.

True. The June database contains 341 "A1" entries, compared to 96 such entries in April and May and even fewer before. We'll have to do something about that.

So the first question is, are the Tor relays that they're adding only
single IP blocks?

I looked up all relays in the consensus from 2012-07-01 00:00:00 UTC in the June database. Here are the matches:

from_addr,to_addr,country num_addr relay_addr
773032192,773032447,A1 256 46.19.137.238
782665472,782666751,A1 1280 46.166.135.90
1249381764,1249381764,A1 1 74.120.13.132
1308079522,1308079525,A1 4 77.247.181.162
1308079522,1308079525,A1 4 77.247.181.163
1308079522,1308079525,A1 4 77.247.181.164
1308079522,1308079525,A1 4 77.247.181.165
1357767240,1357769287,A1 2048 80.237.226.73
1357767240,1357769287,A1 2048 80.237.226.74
1357767240,1357769287,A1 2048 80.237.226.75
1357767240,1357769287,A1 2048 80.237.226.76
1388570842,1388570842,A1 1 82.195.232.218
1839458760,1839458760,A1 1 109.163.233.200
1999278098,1999278098,A1 1 119.42.144.18
2302190382,2302190382,A1 1 137.56.163.46
2461603763,2461603763,A1 1 146.185.23.179
2919159874,2919159877,A1 4 173.254.216.66
2919159874,2919159877,A1 4 173.254.216.67
2919159874,2919159877,A1 4 173.254.216.68
2919159874,2919159877,A1 4 173.254.216.69
2987528448,2987528703,A1 256 178.18.17.111
2988550435,2988550435,A1 1 178.33.169.35
2991183872,2991185919,A1 2048 178.73.216.197
3341849376,3341853471,A1 4096 199.48.147.35
3341849376,3341853471,A1 4096 199.48.147.36
3341849376,3341853471,A1 4096 199.48.147.37
3341849376,3341853471,A1 4096 199.48.147.38
3341849376,3341853471,A1 4096 199.48.147.39
3341849376,3341853471,A1 4096 199.48.147.40
3341849376,3341853471,A1 4096 199.48.147.41
3392946720,3392946720,A1 1 202.60.66.32
3584245803,3584245803,A1 1 213.163.64.43
3584246066,3584246066,A1 1 213.163.65.50

The 12 ranges covering 4 addresses or less look like we could repair them by merging them with the previous and/or next range. The 6 ranges with more than 4 addresses might be more difficult, at least without using additional information.

What we could also do is look up the "A1" ranges in a second GeoIP database and use those entries. We'd have to do it in a way that is easily reproducible for others.

comment:4 Changed 7 years ago by nickm_mobile

One shortterm idea is to traceroute suspect addresses, and put them in whatever country is the closest one in the traceroute that has a country.

Longerterm, let's put node countries in the consensus or microdescs?

comment:5 Changed 7 years ago by nickm

The "look for cases where the previous and next entry are in the same country" rule resolves 90% of the A1 entres in the June maxmind db.

Before applying that rule, we put 0.57% of all routers in A1. After that rule, we put only 0.07% of the routers (that is to say, 2 of them) in the A1 region.

Attaching the toy script I wrote.

Changed 7 years ago by nickm

Attachment: geoipclean.py added

comment:6 Changed 7 years ago by nickm

Keywords: tor-client added

comment:7 Changed 7 years ago by nickm

Component: Tor ClientTor

Changed 7 years ago by karsten

Attachment: deanonymind.py added

comment:8 in reply to:  5 Changed 7 years ago by karsten

Status: newneeds_review

Replying to nickm:

The "look for cases where the previous and next entry are in the same country" rule resolves 90% of the A1 entres in the June maxmind db.

After looking into using Software77's database or the RIR delegation files as a replacement, I like the approach you suggest here best. I think we should resolve those 90% of A1 entries automatically and have a human fix the remaining 10% by using RIR delegation files as a reference and using common sense. If we document what changes we made and make it easy for others to verify our decisions, I think we should be all set.

I wrote a script to fix the simple cases in MaxMind's database, and I extended blockfinder to show differences between GeoIP databases. Here's what I did to fix the 90%+10% of A1 entries. I wrote this down as a documentation that we can ship together with the geoip file for others to verify what we did.

Clone blockfinder:

git clone https://github.com/ioerror/blockfinder
cd blockfinder/

Download MaxMind GeoLite Country database file

wget http://geolite.maxmind.com/download/geoip/database/GeoIPCountryCSV.zip

Download and run deanonymind.py to automatically replace A1 entries with the country code of the previous and next entry if the two agree.

python deanonymind.py GeoIPCountryCSV.zip AutomaticGeoIPCountryWhois.csv

Unzip the original MaxMind file and compare it to the new file.

unzip GeoIPCountryCSV.zip
diff -U1 GeoIPCountryWhois.csv AutomaticGeoIPCountryWhois.csv | less

Copy new file to have a reference for manual changes.

cp AutomaticGeoIPCountryWhois.csv ManualGeoIPCountryWhois.csv

Initialize cache with RIR delegation files, MaxMind's original file, and the modified file:

python blockfinder -i
python blockfinder -r GeoIPCountryWhois.csv
python blockfinder -r ManualGeoIPCountryWhois.csv

Run blockfinder to compare the three data sources for the A1 country code.

python blockfinder -p A1 | less

Scroll down to "Assignments in 'ManualGeoIPCountryWhois.csv'". The blocks shown there are the A1 entries that could not be resolved by deanonymind.py, most likely because previous and subsequent country codes do not match. There are 19 such entries in the November 2012 database, so within scope for a human to fix. Here's an example:

  NL 31.171.128.0-31.171.133.255 GeoIPCountryWhois.csv
> A1 31.171.134.0-31.171.135.255 GeoIPCountryWhois.csv
  IT 31.171.136.0-31.171.143.255 GeoIPCountryWhois.csv
  NL 31.171.128.0-31.171.133.255 ManualGeoIPCountryWhois.csv
< A1 31.171.134.0-31.171.135.255 ManualGeoIPCountryWhois.csv
  IT 31.171.136.0-31.171.143.255 ManualGeoIPCountryWhois.csv
* NL 31.171.128.0-31.171.135.255 rir
  IT 31.171.136.0-31.171.143.255 rir

In this case the two MaxMind files still agree that 31.171.134.0-31.171.135.255 should be assigned to A1 whereas the RIR delegation files say NL. It seems clear that NL is correct here, so we can manually change this line in ManualGeoIPCountryWhois.csv to NL. Repeat 18 times for the remaining A1 entries.

Re-import ManualGeoIPCountryWhois.csv and re-run the comparison:

python blockfinder -r ManualGeoIPCountryWhois.csv
python blockfinder -p A1 | less

There should be no "Assignments in 'ManualGeoIPCountryWhois.csv'" section anymore, because all A1 entries should have been edited by now. But there is an "Assignments in 'GeoIPCountryWhois.csv'" section with quite a lot of blocks in it. There are two types of conflicts, and we're only interested in one of them: the uninteresting conflict is where GeoIPCountryWhois.csv has an assignment for A1 and both ManualGeoIPCountryWhois.csv and rir agree on another country code. For example:

  US 8.10.6.244-8.12.36.255 GeoIPCountryWhois.csv
< A1 8.12.37.0-8.12.37.255 GeoIPCountryWhois.csv
  US 8.12.38.0-8.14.223.255 GeoIPCountryWhois.csv
  US 8.10.6.244-8.12.36.255 ManualGeoIPCountryWhois.csv
* US 8.12.37.0-8.12.37.255 ManualGeoIPCountryWhois.csv
  US 8.12.38.0-8.14.223.255 ManualGeoIPCountryWhois.csv
* US 8.0.0.0-8.255.255.255 rir

This conflict implies that either deanonmind.py or our manual edits were likely correct, so it's uninteresting. But then there's another type of conflict where all three databases have a different assignment. These conflicting lines are prefixed with '#' instead of '*'. The first such conflict is:

  CA 38.80.64.0-38.80.71.255 GeoIPCountryWhois.csv
< A1 38.80.72.0-38.80.73.255 GeoIPCountryWhois.csv
  CA 38.80.74.0-38.80.75.255 GeoIPCountryWhois.csv
  CA 38.80.64.0-38.80.71.255 ManualGeoIPCountryWhois.csv
* CA 38.80.72.0-38.80.73.255 ManualGeoIPCountryWhois.csv
  CA 38.80.74.0-38.80.75.255 ManualGeoIPCountryWhois.csv
# US 38.0.0.0-38.255.255.255 rir

This conflict is interesting, but still can be ignored after reviewing it. It's quite obvious that our choice of CA is more likely correct even if it conflicts with the RIR delegation files which say US.

There are 11 '#' conflicts for the November database, after automatic and manual changes, and we'll have to look at each of them. If we're unhappy with a conflict, we'll have to edit ManualGeoIPCountryWhois.csv again, re-import it, and look again.

Review manual changes a last time:

diff -U1 AutomaticGeoIPCountryWhois.csv ManualGeoIPCountryWhois.csv | less

Convert new file to Tor's geoip file format:

cut -d, -f3-5 < ManualGeoIPCountryWhois.csv | sed 's/"//g' > geoip

Prepend geoip file with a comment like the following:

# Last updated based on November 7 2012 Maxmind GeoLite Country
# See $SOME_README_FILE_OR_TRAC_LINK for details on the conversion.

Commit the new geoip file to tor's src/config/, done.

If you like this approach, I have an A1-less November 2012 database here that we can ship with the next Tor version. I'd need to know how we'd want to document a) the general approach (basically what I described in this comment) and b) the manual changes.

comment:9 Changed 7 years ago by nickm

I think this is a plausible approach. It's annoyingly labor-intensive, though. Here are some suggestions:

  • As much as possible of the above (excluding downloading the files and blockfinder) should be done with scripts; there should not be a 12-step process that people need to do by hand whenever maxmind updates.
  • It would be best to have some way to record the outcome of the manual resolution decisions, and re-apply them, so that A) it's easy to document which manual changes we made, and B) we don't need to resolve the same conflict more than once. Could there be a file that contains the manual changes that we re-apply to resolve conflicts?

I'd need to know how we'd want to document a) the general approach (basically what I described in this comment) and b) the manual changes.

The instructions for making the file should really in in a README.geoip file distributed with the code. For documenting the manual changes, see above.

comment:10 in reply to:  9 Changed 7 years ago by karsten

Status: needs_reviewneeds_revision

Replying to nickm:

I think this is a plausible approach. It's annoyingly labor-intensive, though. Here are some suggestions:

  • As much as possible of the above (excluding downloading the files and blockfinder) should be done with scripts; there should not be a 12-step process that people need to do by hand whenever maxmind updates.

Makes sense. We could make deanonymind.py call blockfinder and add explanatory text similar to the text from my previous comment. What remains is for a human to check the output of deanonymind.py very carefully.

  • It would be best to have some way to record the outcome of the manual resolution decisions, and re-apply them, so that A) it's easy to document which manual changes we made, and B) we don't need to resolve the same conflict more than once. Could there be a file that contains the manual changes that we re-apply to resolve conflicts?

Agreed. Let's make a file for manual changes (e.g., src/config/geoip-manual) and have deanonymind.py apply it. Here's how this file could look like:

# NL, because previous MaxMind entry 31.171.128.0-31.171.133.255 is NL,
# and RIR delegation files say 31.171.128.0-31.171.135.255 is NL.
# -KL 2012-11-27
"31.171.134.0","31.171.135.255","531334656","531335167","NL","Netherlands"

# XY, because [...]
"[...]"

The script would look for an existing A1 entry with the given start and end address and replace it with this line. It would also warn if it was unable to replace an entry.

I'd need to know how we'd want to document a) the general approach (basically what I described in this comment) and b) the manual changes.

The instructions for making the file should really in in a README.geoip file distributed with the code. For documenting the manual changes, see above.

Here's how this file could look like:

The IP-to-country-code file in src/config/geoip is based on MaxMind's
GeoLite Country database, with the following modifications to entries
mapping IP address ranges to "A1" ("Anonymous Proxy"):

 - Those "A1" entries lying inbetween two entries with the same country
   code are automatically changed to that country code.  These changes can
   be overriden by specifying a different country code for one or more of
   these entries in src/config/geoip-manual.

 - Other "A1" entries are replaced with country codes specified in
   src/config/geoip-manual, or are left as is if there is no replacement
   entry in that file.

Run src/config/deanonymind.py for details.

So, the next step would be to extend deanonymind.py to check for GeoIPCountryCSV.zip and possibly geoip-manual as input and to check if blockfinder is available. Then it would do the steps described in my previous commit including applying manual changes, and write its output to AutomaticGeoIPCountryWhois.csv, ManualGeoIPCountryWhois.csv, and geoip. That would reduce manual steps a lot.

I'll hack on this now and create a tor branch for it. This might take me a while. Thanks for your feedback! If you have more ideas, please let me know.

comment:11 Changed 7 years ago by karsten

Status: needs_revisionneeds_review

Please review branch task-6266 in my public repository. I implemented the geoip-manual idea which will save a lot of manual work. I decided not to depend on blockfinder, but only to suggest it as a powerful verification tool, mostly because external dependencies like this are bad, but also because people can already do a fair amount of verification by looking at geoip-manual and using diff. I hope the instructions in README.geoip are simple enough. The task-6266 branch also contains the cleaned up November 2012 database. Thanks!

comment:12 Changed 7 years ago by nickm

Looks workable! I haven't reviewed the python closely, but it seems not to do anything insanely risky, and it's not going to get run automatically on untrusted input or anything.

I think this can get merged to master. BUt should the geoip file get merged to previous maintained branches?

comment:13 in reply to:  12 Changed 7 years ago by karsten

Replying to nickm:

Looks workable! I haven't reviewed the python closely, but it seems not to do anything insanely risky, and it's not going to get run automatically on untrusted input or anything.

Cool! (I'm glad for any input to improve my Python coding skills, so if you ever find yourself with nothing else to do, or at least with everything else being even less exciting, please let me know how I could improve this particular piece of Python. Thanks! :) )

I think this can get merged to master. BUt should the geoip file get merged to previous maintained branches?

Yes, it should. I based previous geoip file updates on maint-0.2.2.

I assume we won't ship deanyonmind.py and README.geoip with the maintained branches, will we? In that case we should probably replace the comment in geoip with

# Last updated based on November 7 2012 Maxmind GeoLite Country
# See README.geoip in Git master for details on the conversion.

or

# Last updated based on November 7 2012 Maxmind GeoLite Country
# Visit the following URL for details on the conversion:
# https://gitweb.torproject.org/tor.git/blob/HEAD:/src/config/README.geoip

Please let me know what changes I should make to the task-6266 branch. Of course, if those changes are trivial, please feel free to just change/squash/rebase that branch yourself. Thanks!

comment:14 Changed 7 years ago by ioerror

i think ideally that blockfinder should do all of this work - please consider adding code to blockfinder that does exactly what deanonymind does and I'll merge it.

comment:15 Changed 7 years ago by ioerror

To be specific about my thoughts - It seems to me that blockfinder should be the tool that when run, it should download all the files, outputs hashes of all the data it used as inputs and then output the fixups that you wish to see. That way, we can simply include the output of that single run - to reproduce it, one would run blockfinder in exactly the same way. It might also make sense to allow a user to give a specific set of hashes and have blockfinder download *those* files from the MaxMind website or from our website.

comment:16 in reply to:  14 Changed 7 years ago by karsten

Replying to ioerror:

i think ideally that blockfinder should do all of this work - please consider adding code to blockfinder that does exactly what deanonymind does and I'll merge it.

I disagree for two reasons. The first reason is that the current deanonymind.py doesn't fit into blockfinder very well. deanonymind.py takes two files as input (original MaxMind file and geoip-manual) and produces three files as output (two modified MaxMind files and the geoip file for tor). blockfinder is designed around a local IP-to-country database cache with its usage modes being either to modify the cache or request information from it. What we'd have to do to integrate deanonymind.py is split it up into multiple modes to a) make a country code "disappear" by automatically merging its entries with adjacent entries, b) apply manual changes from a file, c) export to CSV in long and short format. These changes are not impossible to make. However, I'm currently lacking the developer time to make them.

The second and more important reason is that we should really avoid that tor relies on blockfinder for creating the modified geoip file. It's a great tool for verifying the output of those modifications, and I can highly recommend it for that, but it should stay optional. The main reason is that as many people as possible should be able to verify what modifications we make to MaxMind's database. The current 194 lines of Python in deanonymind.py and the 114 lines of documented manual changes in geoip-manual are probably at the upper limit of what we can expect enthusiastic community members to read and understand. And they can also use diff or their favorite tool to do this verification, because we give them all intermediate .csv files. Or they can use blockfinder if they wish. But giving them the 844 lines of blockfinder to review, which would probably grow far beyond 1000 lines when adding the A1-fixing functionality, means that hardly anybody will check what's going on.

Note that this discussion is unrelated from using a database produced by blockfinder as a general replacement for MaxMind's patched-up database. Ideally, we'd take the RIR delegation files as input, maybe add LIR information, run traceroutes and whatever else to confirm/contradict these assignments, and basically produce our own IP-to-country database. I'm willing to contribute more code to blockfinder to get closer to that, e.g., a CSV export function. Please open blockfinder issues for features you think are missing, and I might hack on them as time permits. But I think that's a separate discussion. In my understanding, this is something that can happen in 6--12 months from now, assuming we put enough energy into it, but not earlier.

Unrelated to your concerns, I'd be interested in your thoughts on the geoip-manual file. Can you review those manual changes? If you have additional facts, it would be good to add them as comments. And, of course, if you have contradicting facts, that would be even more important to know.

Leaving in needs_review, because I still think this code should be reviewed and merged into tor.

comment:17 Changed 7 years ago by karsten

Updated my task-6266 branch to the December database update.

What's the reviewing/merging status here?

comment:18 Changed 7 years ago by nickm

Resolution: fixed
Status: needs_reviewclosed

Merged this. I still haven't done more than sanity-check the python code, but since this program only needs to work for a couple of people, it doesn't need to be beautiful.

I guess we should have a new ticket for backporting geoip files? Closing this one.

If some other program should do this stuff, it'd be fine to have a ticket for that, though this is enough of a kludge that I'm okay having it be its own little kludgey script.

Note: See TracTickets for help on using tickets.