Design and implement a more compact GeoIP file format
|Reported by:||rransom||Owned by:||endian7000|
|Actual Points:||Parent ID:|
Our current text-based GeoIP file (as of commit e9803aa71003079cc00a8b3c80324581758a36be; from the January 2011 MaxMind GeoLite Country dataset) is 3460049 bytes long (or 955382 bytes gzipped). In MaxMind's binary format, the February 2011 dataset is 1126966 bytes long, and gzips to about half that size. But we can do much better than that, and without having to use (or reverse-engineer and clone) their LGPL library.
The January 2011 GeoLite database contains 138658 data lines, each of which specifies a sequence of consecutive IPs assigned to a single country. The file contains runs of 4070 distinct lengths, and maps runs to 241 distinct countries. Even doubling the number of runs in order to account for the fact that some IPs are not contained in any run (which we should consider as a run assigned to 'no country'), and padding each run to a 3-byte field, we can store the mapping itself in at most 813 kiB, with a run-length table and country table totalling under 17 kiB. We can fit an additional random-access index consisting of one 4-byte starting IP for each 768-byte (256-run) block in just over 4 kiB if we want to keep the database itself in its packed form, whether in memory or on disk.
813 kiB is probably a wild overestimate for the size of the mapping; I haven't checked how many 'fake runs' we would need to add, but I would expect there are far fewer unassigned runs than runs assigned to a country in the database. I'm also not relying on any fancy encoding that would fit each run in less than 3 bytes.
Change History (24)
comment:5 follow-up: ↓ 6 Changed 3 years ago by endian7000
- Owner set to endian7000
- Status changed from new to assigned
comment:6 in reply to: ↑ 5 Changed 3 years ago by rransom
- Milestone changed from Tor: unspecified to Tor: 0.2.3.x-final
Changed 3 years ago by nickm
comment:17 Changed 2 years ago by nickm
- Milestone changed from Tor: 0.2.3.x-final to Tor: 0.2.4.x-final