Opened 8 years ago

Closed 8 years ago

#6438 closed task (wontfix)

Evaluate software77's geoip database

Reported by: nickm Owned by:
Priority: Medium Milestone: Tor: 0.2.3.x-final
Component: Core Tor/Tor Version:
Severity: Keywords: geoip tor-client
Cc: arma Actual Points:
Parent ID: #6266 Points:
Reviewer: Sponsor:

Description

As a stopgap, since blockfinder isn't yet giving us geoip files and maxmind is deteriorating, is to consider use of software77.net/geo-ip 's geoip database.

The format is almost the same as Maxmind's. Two issues to evaluate before we can use it:

  • The database says it's licensed under the GPLv3. Would we satisfy the terms of the GPLv3 by merely attributing it correctly and distributing the database itself under GPLv3, or would any part of Tor's license need to change? Wendy claims the former, and suggests that copyrights on databases are already pretty weak in the US.
  • How good is the database? To what extent does it agree/disagree with maxmind's?

Child Tickets

Attachments (1)

geoiptest.py (2.0 KB) - added by nickm 8 years ago.

Download all attachments as: .zip

Change History (10)

comment:1 Changed 8 years ago by nickm

To process their file into the format Tor likes, the preprocesing magic is:

% zcat IpToCountry.csv.gz | cut -d, -f1,2,5 | sed 's/"//g' > geoip

comment:2 Changed 8 years ago by nickm

It is about 2/3 the size of Maxmind's file in terms of number of entries.

comment:3 Changed 8 years ago by nickm

Here are the results I got using the may geoip file currently in src/config.

Looking at relay IPs currently in the microdesc consensus, it seems that for 97.3% of them, software77 agrees with Maxmind, or has an opinion where Maxmind did not. The ones where it differs are:

109.169.61.87 US => GB
109.73.160.179 IL => GB
138.199.69.233 NL => EU
142.4.33.231 CA => US
149.154.158.225 US => AT
176.31.132.137 GB => FR
176.31.15.236 GB => FR
176.31.64.252 IT => FR
176.31.69.74 GB => FR
176.31.9.77 GB => FR
178.17.48.136 IQ => DE
178.18.17.111 US => NL
178.18.254.11 US => DE
178.32.52.72 GB => FR
178.32.65.82 CZ => FR
178.33.169.35 GB => FR
178.33.32.123 DE => FR
178.33.33.209 DE => FR
188.165.24.70 LT => FR
188.165.73.221 IE => FR
192.71.245.137 IT => SE
194.14.179.62 IT => SE
194.150.168.79 EU => DE
194.150.168.95 EU => DE
195.5.121.253 NL => DE
199.36.123.104 CA => US
199.36.123.117 CA => US
199.36.123.21 CA => US
199.36.123.44 CA => US
199.36.123.88 CA => US
213.16.14.67 MQ => FR
216.12.198.82 SG => US
216.12.198.83 SG => US
216.12.198.84 SG => US
216.12.214.106 SG => US
37.205.9.131 CZ => SK
46.166.131.158 LU => GB
46.166.135.90 US => GB
46.166.146.152 RU => GB
46.166.146.194 RU => GB
46.166.159.35 DE => GB
46.166.159.51 DE => GB
46.166.159.52 DE => GB
46.166.159.53 DE => GB
46.166.159.54 DE => GB
46.166.159.55 DE => GB
46.166.159.56 DE => GB
46.166.159.57 DE => GB
46.166.159.58 DE => GB
46.166.159.59 DE => GB
46.166.159.90 DE => GB
46.166.159.91 DE => GB
46.166.159.92 DE => GB
46.21.151.71 US => NL
69.147.252.41 IN => US
69.195.211.198 CO => US
69.195.211.203 CO => US
69.90.151.229 CA => US
70.33.208.83 CA => US
74.120.12.135 DE => US
74.120.12.140 DE => US
74.120.13.132 DE => US
74.120.15.150 DE => US
81.58.10.210 NL => BE
82.146.49.65 US => RU
83.125.20.240 DE => EU
83.133.106.73 DE => EU
84.19.175.182 IT => DE
84.233.197.147 IT => GB
87.98.250.244 GB => FR
89.105.41.162 EU => NO
94.23.120.170 GB => FR
94.23.147.149 NL => FR
94.23.147.164 NL => FR
94.23.148.23 NL => FR
94.23.150.191 NL => FR
94.23.153.225 GB => FR
94.23.164.42 DE => FR
94.23.168.39 CZ => FR
94.23.168.56 CZ => FR
94.23.174.3 CZ => FR
0.973062853342

If I use the June geoip file instead, I get 96% agreement or improvement.

This doesn't seem to be a fluke; when I choose a large number of IPs generated at random, I see about 98% agreement or improvement.

I'm attaching the script I used to do this test.

Changed 8 years ago by nickm

Attachment: geoiptest.py added

comment:4 Changed 8 years ago by karsten

Regarding the license question, I took a closer look at the Software77 database and found that it's equivalent to the content of ftp://ftp.arin.net/pub/stats/arin/delegated-arin-latest and friends. The only thing that Software77 does is add reserved address spaces as country "ZZ", but we'd want to exclude those anyway. So, we can simply concat the "ipv4" lines from ARIN et al. and call the license question solved. We can also generating "ipv6" and "asn" databases from the same files.

Using the data provided by the five registries directly also has the advantage that we don't have to rely on a third party anymore. We ran into that problem with ip-to-country removing almost all US addresses in September 2009 and now with Maxmind classifying relay addresses as "A1"; there's no guarantee that Software77 won't do something similarly stupid in the future.

Regarding the question how much Software77/the registries agree with Maxmind, I'd like to run another comparison when Maxmind publishes their September database. Then we can traceroute relay addresses that they disagree about. I'll keep an eye on Maxmind's website in the next days to see when there September database is available.

comment:5 in reply to:  4 Changed 8 years ago by karsten

Replying to karsten:

Regarding the question how much Software77/the registries agree with Maxmind, I'd like to run another comparison when Maxmind publishes their September database. Then we can traceroute relay addresses that they disagree about. I'll keep an eye on Maxmind's website in the next days to see when there September database is available.

Of course they published the database a few hours after I posted this comment.

Here are the results of Nick's script (with some trivial modifications) when running it on Maxmind's September 5 database, a Software77-like database built on September 5 from the five registry databases, and the consensus from September 5, 22:00 UTC:

testing the consensus
109.163.238.48 de => ro
138.199.68.230 nl => eu
149.154.158.225 us => at
173.245.79.54 cn => us
176.31.132.137 gb => fr
176.31.15.236 gb => fr
176.31.48.135 gb => fr
178.18.254.11 us => de
178.32.246.74 gb => fr
178.32.65.82 cz => fr
178.33.32.123 de => fr
188.165.73.221 ie => fr
192.71.245.137 it => se
192.71.245.72 it => se
192.71.245.89 it => se
194.150.168.79 eu => de
195.5.121.253 nl => de
199.36.123.113 ca => us
205.185.117.40 us => ca
208.111.45.245 in => us
209.141.61.9 us => ca
209.141.61.98 us => ca
216.12.198.82 sg => us
216.12.198.83 sg => us
216.12.198.84 sg => us
216.12.214.106 sg => us
216.231.135.28 es => us
37.205.9.131 cz => sk
37.235.48.132 pl => at
37.235.49.157 is => at
37.235.49.37 is => at
37.59.237.163 gb => fr
46.105.174.75 nl => fr
46.166.143.131 lu => gb
46.166.159.35 de => gb
46.166.159.51 de => gb
46.166.159.52 de => gb
46.166.159.53 de => gb
46.166.159.54 de => gb
46.166.159.55 de => gb
46.166.159.56 de => gb
46.166.159.57 de => gb
46.166.159.58 de => gb
46.166.159.59 de => gb
46.166.159.90 de => gb
46.166.159.91 de => gb
46.166.159.92 de => gb
46.21.151.71 us => nl
50.7.194.122 cz => us
50.7.240.10 cz => us
50.7.241.218 cz => us
50.7.246.50 cz => us
50.7.246.51 cz => us
50.7.246.52 cz => us
50.7.246.53 cz => us
50.7.246.54 cz => us
50.7.248.234 cz => us
50.7.248.235 cz => us
50.7.248.236 cz => us
50.7.248.237 cz => us
50.7.248.238 cz => us
50.7.253.194 cz => us
50.7.253.195 cz => us
50.7.253.196 cz => us
50.7.253.197 cz => us
50.7.253.198 cz => us
50.7.253.234 cz => us
50.7.253.235 cz => us
50.7.253.236 cz => us
50.7.253.237 cz => us
50.7.253.238 cz => us
54.247.9.57 ie => us
69.147.252.41 in => us
69.195.211.198 co => us
69.195.211.203 co => us
69.90.151.229 ca => us
74.116.249.71 se => us
74.120.12.135 de => us
74.120.12.140 de => us
74.120.15.150 de => us
77.244.254.227 de => at
77.244.254.228 de => at
77.244.254.229 de => at
77.244.254.230 de => at
82.146.49.65 us => ru
83.125.20.240 de => eu
83.133.106.73 de => eu
83.133.224.61 de => eu
84.19.175.182 it => de
84.200.76.196 us => de
84.233.197.147 it => gb
87.98.250.244 gb => fr
91.121.245.171 it => fr
94.23.117.228 de => fr
94.23.117.229 de => fr
94.23.120.170 gb => fr
94.23.147.149 nl => fr
94.23.147.164 nl => fr
94.23.148.23 nl => fr
94.23.150.191 nl => fr
94.23.153.225 gb => fr
94.23.164.42 de => fr
94.23.168.39 cz => fr
94.23.68.252 it => fr
94.23.70.173 it => fr
94.23.73.182 it => fr
0.964725457571
Testing 10000 random IPs
0.9833

I looked up a few of the addresses from that list (traceroute, whois, relay nicknames, contacts). It seems that Maxmind is correct in most of the cases and that the registry files are wrong.

Interestingly, whois requests agree with Maxmind in most (if not all) cases. It seems that the Maxmind database uses bulk whois data rather than the publicly available files.

How much do we care about the 3.6% of wrongly identified relay addresses and 1.7% wrongly identified random/client addresses? We could contact the registries and ask for access to their bulk data. This might require some more parsing code on our end though. This is the as-good-as-Maxmind variant.

If we don't care as much, I'll rewrite the script that puts together the five registry files for Tor's contrib/ directory, and we can call it done. We could easily make a new geoip file whenever we put out a new Tor release. This is the as-good-as-Software77-variant.

comment:6 Changed 8 years ago by karsten

Cc: arma added

*bump*

comment:7 Changed 8 years ago by nickm

Keywords: tor-client added

comment:8 Changed 8 years ago by nickm

Component: Tor ClientTor

comment:9 Changed 8 years ago by nickm

Resolution: wontfix
Status: newclosed

mooted by solution to #6266

Note: See TracTickets for help on using tickets.