The more A1 relays we have, the more useless it is to think settings EntryNodes or ExitNodes by country works. Especially for people who are trying to Exclude countries.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related.
Learn more.
The more A1 relays we have, the more useless it is to think settings
EntryNodes or ExitNodes by country works. Especially for people who are
trying to Exclude countries.
True. The June database contains 341 "A1" entries, compared to 96 such entries in April and May and even fewer before. We'll have to do something about that.
So the first question is, are the Tor relays that they're adding only
single IP blocks?
I looked up all relays in the consensus from 2012-07-01 00:00:00 UTC in the June database. Here are the matches:
The 12 ranges covering 4 addresses or less look like we could repair them by merging them with the previous and/or next range. The 6 ranges with more than 4 addresses might be more difficult, at least without using additional information.
What we could also do is look up the "A1" ranges in a second GeoIP database and use those entries. We'd have to do it in a way that is easily reproducible for others.
The "look for cases where the previous and next entry are in the same country" rule resolves 90% of the A1 entres in the June maxmind db.
Before applying that rule, we put 0.57% of all routers in A1. After that rule, we put only 0.07% of the routers (that is to say, 2 of them) in the A1 region.
The "look for cases where the previous and next entry are in the same country" rule resolves 90% of the A1 entres in the June maxmind db.
After looking into using Software77's database or the RIR delegation files as a replacement, I like the approach you suggest here best. I think we should resolve those 90% of A1 entries automatically and have a human fix the remaining 10% by using RIR delegation files as a reference and using common sense. If we document what changes we made and make it easy for others to verify our decisions, I think we should be all set.
I wrote a script to fix the simple cases in MaxMind's database, and I extended blockfinder to show differences between GeoIP databases. Here's what I did to fix the 90%+10% of A1 entries. I wrote this down as a documentation that we can ship together with the geoip file for others to verify what we did.
Run blockfinder to compare the three data sources for the A1 country code.
python blockfinder -p A1 | less
Scroll down to "Assignments in 'ManualGeoIPCountryWhois.csv'". The blocks shown there are the A1 entries that could not be resolved by deanonymind.py, most likely because previous and subsequent country codes do not match. There are 19 such entries in the November 2012 database, so within scope for a human to fix. Here's an example:
NL 31.171.128.0-31.171.133.255 GeoIPCountryWhois.csv> A1 31.171.134.0-31.171.135.255 GeoIPCountryWhois.csv IT 31.171.136.0-31.171.143.255 GeoIPCountryWhois.csv NL 31.171.128.0-31.171.133.255 ManualGeoIPCountryWhois.csv< A1 31.171.134.0-31.171.135.255 ManualGeoIPCountryWhois.csv IT 31.171.136.0-31.171.143.255 ManualGeoIPCountryWhois.csv* NL 31.171.128.0-31.171.135.255 rir IT 31.171.136.0-31.171.143.255 rir
In this case the two MaxMind files still agree that 31.171.134.0-31.171.135.255 should be assigned to A1 whereas the RIR delegation files say NL. It seems clear that NL is correct here, so we can manually change this line in ManualGeoIPCountryWhois.csv to NL. Repeat 18 times for the remaining A1 entries.
Re-import ManualGeoIPCountryWhois.csv and re-run the comparison:
python blockfinder -r ManualGeoIPCountryWhois.csvpython blockfinder -p A1 | less
There should be no "Assignments in 'ManualGeoIPCountryWhois.csv'" section anymore, because all A1 entries should have been edited by now. But there is an "Assignments in 'GeoIPCountryWhois.csv'" section with quite a lot of blocks in it. There are two types of conflicts, and we're only interested in one of them: the uninteresting conflict is where GeoIPCountryWhois.csv has an assignment for A1 and both ManualGeoIPCountryWhois.csv and rir agree on another country code. For example:
US 8.10.6.244-8.12.36.255 GeoIPCountryWhois.csv< A1 8.12.37.0-8.12.37.255 GeoIPCountryWhois.csv US 8.12.38.0-8.14.223.255 GeoIPCountryWhois.csv US 8.10.6.244-8.12.36.255 ManualGeoIPCountryWhois.csv* US 8.12.37.0-8.12.37.255 ManualGeoIPCountryWhois.csv US 8.12.38.0-8.14.223.255 ManualGeoIPCountryWhois.csv* US 8.0.0.0-8.255.255.255 rir
This conflict implies that either deanonmind.py or our manual edits were likely correct, so it's uninteresting. But then there's another type of conflict where all three databases have a different assignment. These conflicting lines are prefixed with '#' instead of '*'. The first such conflict is:
CA 38.80.64.0-38.80.71.255 GeoIPCountryWhois.csv< A1 38.80.72.0-38.80.73.255 GeoIPCountryWhois.csv CA 38.80.74.0-38.80.75.255 GeoIPCountryWhois.csv CA 38.80.64.0-38.80.71.255 ManualGeoIPCountryWhois.csv* CA 38.80.72.0-38.80.73.255 ManualGeoIPCountryWhois.csv CA 38.80.74.0-38.80.75.255 ManualGeoIPCountryWhois.csv# US 38.0.0.0-38.255.255.255 rir
This conflict is interesting, but still can be ignored after reviewing it. It's quite obvious that our choice of CA is more likely correct even if it conflicts with the RIR delegation files which say US.
There are 11 '#' conflicts for the November database, after automatic and manual changes, and we'll have to look at each of them. If we're unhappy with a conflict, we'll have to edit ManualGeoIPCountryWhois.csv again, re-import it, and look again.
Review manual changes a last time:
diff -U1 AutomaticGeoIPCountryWhois.csv ManualGeoIPCountryWhois.csv | less
Convert new file to Tor's geoip file format:
cut -d, -f3-5 < ManualGeoIPCountryWhois.csv | sed 's/"//g' > geoip
Prepend geoip file with a comment like the following:
# Last updated based on November 7 2012 Maxmind GeoLite Country# See $SOME_README_FILE_OR_TRAC_LINK for details on the conversion.
Commit the new geoip file to tor's src/config/, done.
If you like this approach, I have an A1-less November 2012 database here that we can ship with the next Tor version. I'd need to know how we'd want to document a) the general approach (basically what I described in this comment) and b) the manual changes.
I think this is a plausible approach. It's annoyingly labor-intensive, though. Here are some suggestions:
As much as possible of the above (excluding downloading the files and blockfinder) should be done with scripts; there should not be a 12-step process that people need to do by hand whenever maxmind updates.
It would be best to have some way to record the outcome of the manual resolution decisions, and re-apply them, so that A) it's easy to document which manual changes we made, and B) we don't need to resolve the same conflict more than once. Could there be a file that contains the manual changes that we re-apply to resolve conflicts?
I'd need to know how we'd want to document a) the general approach (basically what I described in this comment) and b) the manual changes.
The instructions for making the file should really in in a README.geoip file distributed with the code. For documenting the manual changes, see above.
I think this is a plausible approach. It's annoyingly labor-intensive, though. Here are some suggestions:
As much as possible of the above (excluding downloading the files and blockfinder) should be done with scripts; there should not be a 12-step process that people need to do by hand whenever maxmind updates.
Makes sense. We could make deanonymind.py call blockfinder and add explanatory text similar to the text from my previous comment. What remains is for a human to check the output of deanonymind.py very carefully.
It would be best to have some way to record the outcome of the manual resolution decisions, and re-apply them, so that A) it's easy to document which manual changes we made, and B) we don't need to resolve the same conflict more than once. Could there be a file that contains the manual changes that we re-apply to resolve conflicts?
Agreed. Let's make a file for manual changes (e.g., src/config/geoip-manual) and have deanonymind.py apply it. Here's how this file could look like:
# NL, because previous MaxMind entry 31.171.128.0-31.171.133.255 is NL,# and RIR delegation files say 31.171.128.0-31.171.135.255 is NL.# -KL 2012-11-27"31.171.134.0","31.171.135.255","531334656","531335167","NL","Netherlands"# XY, because [...]"[...]"
The script would look for an existing A1 entry with the given start and end address and replace it with this line. It would also warn if it was unable to replace an entry.
I'd need to know how we'd want to document a) the general approach (basically what I described in this comment) and b) the manual changes.
The instructions for making the file should really in in a README.geoip file distributed with the code. For documenting the manual changes, see above.
Here's how this file could look like:
The IP-to-country-code file in src/config/geoip is based on MaxMind'sGeoLite Country database, with the following modifications to entriesmapping IP address ranges to "A1" ("Anonymous Proxy"): - Those "A1" entries lying inbetween two entries with the same country code are automatically changed to that country code. These changes can be overriden by specifying a different country code for one or more of these entries in src/config/geoip-manual. - Other "A1" entries are replaced with country codes specified in src/config/geoip-manual, or are left as is if there is no replacement entry in that file.Run src/config/deanonymind.py for details.
So, the next step would be to extend deanonymind.py to check for GeoIPCountryCSV.zip and possibly geoip-manual as input and to check if blockfinder is available. Then it would do the steps described in my previous commit including applying manual changes, and write its output to AutomaticGeoIPCountryWhois.csv, ManualGeoIPCountryWhois.csv, and geoip. That would reduce manual steps a lot.
I'll hack on this now and create a tor branch for it. This might take me a while. Thanks for your feedback! If you have more ideas, please let me know.
Please review branch task-6266 in my public repository. I implemented the geoip-manual idea which will save a lot of manual work. I decided not to depend on blockfinder, but only to suggest it as a powerful verification tool, mostly because external dependencies like this are bad, but also because people can already do a fair amount of verification by looking at geoip-manual and using diff. I hope the instructions in README.geoip are simple enough. The task-6266 branch also contains the cleaned up November 2012 database. Thanks!
Looks workable! I haven't reviewed the python closely, but it seems not to do anything insanely risky, and it's not going to get run automatically on untrusted input or anything.
I think this can get merged to master. BUt should the geoip file get merged to previous maintained branches?
Looks workable! I haven't reviewed the python closely, but it seems not to do anything insanely risky, and it's not going to get run automatically on untrusted input or anything.
Cool! (I'm glad for any input to improve my Python coding skills, so if you ever find yourself with nothing else to do, or at least with everything else being even less exciting, please let me know how I could improve this particular piece of Python. Thanks! :) )
I think this can get merged to master. BUt should the geoip file get merged to previous maintained branches?
Yes, it should. I based previous geoip file updates on maint-0.2.2.
I assume we won't ship deanyonmind.py and README.geoip with the maintained branches, will we? In that case we should probably replace the comment in geoip with
# Last updated based on November 7 2012 Maxmind GeoLite Country# See README.geoip in Git master for details on the conversion.
or
# Last updated based on November 7 2012 Maxmind GeoLite Country# Visit the following URL for details on the conversion:# https://gitweb.torproject.org/tor.git/blob/HEAD:/src/config/README.geoip
Please let me know what changes I should make to the task-6266 branch. Of course, if those changes are trivial, please feel free to just change/squash/rebase that branch yourself. Thanks!
i think ideally that blockfinder should do all of this work - please consider adding code to blockfinder that does exactly what deanonymind does and I'll merge it.
To be specific about my thoughts - It seems to me that blockfinder should be the tool that when run, it should download all the files, outputs hashes of all the data it used as inputs and then output the fixups that you wish to see. That way, we can simply include the output of that single run - to reproduce it, one would run blockfinder in exactly the same way. It might also make sense to allow a user to give a specific set of hashes and have blockfinder download those files from the MaxMind website or from our website.
i think ideally that blockfinder should do all of this work - please consider adding code to blockfinder that does exactly what deanonymind does and I'll merge it.
I disagree for two reasons. The first reason is that the current deanonymind.py doesn't fit into blockfinder very well. deanonymind.py takes two files as input (original MaxMind file and geoip-manual) and produces three files as output (two modified MaxMind files and the geoip file for tor). blockfinder is designed around a local IP-to-country database cache with its usage modes being either to modify the cache or request information from it. What we'd have to do to integrate deanonymind.py is split it up into multiple modes to a) make a country code "disappear" by automatically merging its entries with adjacent entries, b) apply manual changes from a file, c) export to CSV in long and short format. These changes are not impossible to make. However, I'm currently lacking the developer time to make them.
The second and more important reason is that we should really avoid that tor relies on blockfinder for creating the modified geoip file. It's a great tool for verifying the output of those modifications, and I can highly recommend it for that, but it should stay optional. The main reason is that as many people as possible should be able to verify what modifications we make to MaxMind's database. The current 194 lines of Python in deanonymind.py and the 114 lines of documented manual changes in geoip-manual are probably at the upper limit of what we can expect enthusiastic community members to read and understand. And they can also use diff or their favorite tool to do this verification, because we give them all intermediate .csv files. Or they can use blockfinder if they wish. But giving them the 844 lines of blockfinder to review, which would probably grow far beyond 1000 lines when adding the A1-fixing functionality, means that hardly anybody will check what's going on.
Note that this discussion is unrelated from using a database produced by blockfinder as a general replacement for MaxMind's patched-up database. Ideally, we'd take the RIR delegation files as input, maybe add LIR information, run traceroutes and whatever else to confirm/contradict these assignments, and basically produce our own IP-to-country database. I'm willing to contribute more code to blockfinder to get closer to that, e.g., a CSV export function. Please open blockfinder issues for features you think are missing, and I might hack on them as time permits. But I think that's a separate discussion. In my understanding, this is something that can happen in 6--12 months from now, assuming we put enough energy into it, but not earlier.
Unrelated to your concerns, I'd be interested in your thoughts on the geoip-manual file. Can you review those manual changes? If you have additional facts, it would be good to add them as comments. And, of course, if you have contradicting facts, that would be even more important to know.
Leaving in needs_review, because I still think this code should be reviewed and merged into tor.
Merged this. I still haven't done more than sanity-check the python code, but since this program only needs to work for a couple of people, it doesn't need to be beautiful.
I guess we should have a new ticket for backporting geoip files? Closing this one.
If some other program should do this stuff, it'd be fine to have a ticket for that, though this is enough of a kludge that I'm okay having it be its own little kludgey script.
Trac: Status: needs_review to closed Resolution: N/Ato fixed