Opened 3 years ago

Last modified 15 months ago

#2506 assigned enhancement

Design and implement a more compact GeoIP file format

Reported by: rransom Owned by: endian7000
Priority: normal Milestone: Tor: unspecified
Component: Tor Version:
Keywords: tor-relay Cc: nils@…, dcf@…
Actual Points: Parent ID:
Points:

Description

Our current text-based GeoIP file (as of commit e9803aa71003079cc00a8b3c80324581758a36be; from the January 2011 MaxMind GeoLite Country dataset) is 3460049 bytes long (or 955382 bytes gzipped). In MaxMind's binary format, the February 2011 dataset is 1126966 bytes long, and gzips to about half that size. But we can do much better than that, and without having to use (or reverse-engineer and clone) their LGPL library.

The January 2011 GeoLite database contains 138658 data lines, each of which specifies a sequence of consecutive IPs assigned to a single country. The file contains runs of 4070 distinct lengths, and maps runs to 241 distinct countries. Even doubling the number of runs in order to account for the fact that some IPs are not contained in any run (which we should consider as a run assigned to 'no country'), and padding each run to a 3-byte field, we can store the mapping itself in at most 813 kiB, with a run-length table and country table totalling under 17 kiB. We can fit an additional random-access index consisting of one 4-byte starting IP for each 768-byte (256-run) block in just over 4 kiB if we want to keep the database itself in its packed form, whether in memory or on disk.

813 kiB is probably a wild overestimate for the size of the mapping; I haven't checked how many 'fake runs' we would need to add, but I would expect there are far fewer unassigned runs than runs assigned to a country in the database. I'm also not relying on any fancy encoding that would fit each run in less than 3 bytes.

Child Tickets

Attachments (1)

geoip_compress.py (3.8 KB) - added by nickm 3 years ago.

Download all attachments as: .zip

Change History (24)

comment:1 Changed 3 years ago by rransom

The calculations above are specific to IPv4; for IPv6, we would need to do some research into what current IPv6 IP-to-country datasets look like before designing a new file format for them.

comment:2 Changed 3 years ago by nickm

  • Milestone set to Tor: unspecified

comment:3 follow-up: Changed 3 years ago by mikey

Just another data point, moot if the original intent is to always read from disk, never load the whole GeoIP database into RAM.

But, if the intent is to save disk space once installed, LZMA produces decent results (tested with 7zip and April 2011 files):
1.1M GeoIP.dat compresses into 436KiB dat.7z
3.5M geoip compresses into 551KiB txt.7z

comment:4 in reply to: ↑ 3 Changed 3 years ago by rransom

Replying to mikey:

Just another data point, moot if the original intent is to always read from disk, never load the whole GeoIP database into RAM.

But, if the intent is to save disk space once installed, LZMA produces decent results (tested with 7zip and April 2011 files):
1.1M GeoIP.dat compresses into 436KiB dat.7z

In order to use Maxmind's binary format, we would need to either use their (LGPLed) library or reverse-engineer the format. If we implement our own binary format, I strongly suspect that we can fit the data we need into less than 436 kiB without additional compression. It would also be better to avoid using LZMA, as that would require linking Tor to yet another library.

comment:5 follow-up: Changed 3 years ago by endian7000

  • Owner set to endian7000
  • Status changed from new to assigned

I wrote a simple encoder with these results:

317,108 bytes: geoip-encoded

205,123 bytes: geoip-encoded.gz

https://github.com/andrewschaaf/geoip-packing

Does this look good? I can write a C decoder.

Also, I'm new to Tor development (thanks, Chuck Schumer!) -- how does one submit pull requests for the Tor repo? I'm only familiar with contributing to projects via GitHub.

comment:6 in reply to: ↑ 5 Changed 3 years ago by rransom

  • Milestone changed from Tor: unspecified to Tor: 0.2.3.x-final

Replying to endian7000:

I wrote a simple encoder with these results:

317,108 bytes: geoip-encoded

Great! That's better than I expected!

Keep in mind that if you want to use the dataset efficiently without unpacking all of it into a larger in-memory form, you will need to make some changes:

  • Put the country-code and run-length maps in separate files, rather than sticking all of the data in a single file. These files will be easiest to use if their elements have constant length (so all we need to do is mmap them or read them into buffers, then access them as arrays).
  • Generate an index file. (The index format I suggested requires that each run be encoded with the same length; you'll need to specify the position of each block of runs in addition to the IP address of the first run in the block with your variable-length encoding.)

But this would be quite an improvement even if you just modified our current geoip.c to read a dataset from your format.

205,123 bytes: geoip-encoded.gz

That's even better!

https://github.com/andrewschaaf/geoip-packing

Does this look good? I can write a C decoder.

Great! Start by reading src/or/geoip.c and its header file on master, since that's where this change is likely to be merged to.

Also, I'm new to Tor development (thanks, Chuck Schumer!) -- how does one submit pull requests for the Tor repo? I'm only familiar with contributing to projects via GitHub.

Post the Git URL and branch name on the relevant Trac ticket, and set the ticket's status to needs_review. I include a link to Gitweb for my branches as well (see #3000 for an example), but that's optional.

Also see the [WikiFormatting] page for how to escape an URL so Trac will let us select it (to paste it into a terminal).

Also, please avoid committing large binaries into Git branches; we will want to translate the GeoIP database into binary form either during the usual make process or when building a tarball. (Yes, 200 kB is large for a binary file in Git.)

comment:7 follow-up: Changed 3 years ago by nickm

  • Status changed from assigned to needs_review

Throwing this into needs_review myself.

I'm not so sure that having this stuff in separate run-length and cc files will actually be needed; endianness issues will keep us from reading any portable file into an array-of-country verbatim, I think. Let's see how tricky the read code is before we decide that complexifying the format is worth it in order to make the read code simpler.

I'm also not clear how best to read this format quickly on the fly: unpacking it all into ram seems like a lose if we don't have to; a workable index format would be neat (and would be much easier for fixed-length or self-synchronizing records.

comment:8 follow-up: Changed 3 years ago by nickm

Other thoughts: elsewhere we have used the country code "??" for an unassigned mapping.

We don't have a consistent internal uint8->country mapping in Tor: that needs to get saved too.

Right now the db has 246 country codes; it's be nice to have a good answer here for what we would do if we had to go over 255.

One possibility for making this format self-synchronizing is to reserve the run-length index 0 and the country code 0 such that neither corresponds to a run length or to a country. There are probably better ways too.

Another useful observation: there seem to be less than 32 run lengths that ever occur more than once. Not sure what we can use that for, though.

comment:9 in reply to: ↑ 7 ; follow-up: Changed 3 years ago by rransom

Replying to nickm:

I'm not so sure that having this stuff in separate run-length and cc files will actually be needed; endianness issues will keep us from reading any portable file into an array-of-country verbatim, I think.

The country codes are two-character ASCII strings, and are thus endianness-independent. The run lengths are integers, but could be encoded in big-endian form everywhere.

Let's see how tricky the read code is before we decide that complexifying the format is worth it in order to make the read code simpler.

I think putting each separate array in a separate file would give us a simpler format than putting all of the arrays in a single file.

I'm also not clear how best to read this format quickly on the fly: unpacking it all into ram seems like a lose if we don't have to; a workable index format would be neat (and would be much easier for fixed-length or self-synchronizing records.

We would need an index format even with fixed-length records -- each record corresponds to a wildly varying amount of IP address space. We look up the country associated with an IP address, so we need to read the database in order from some starting point for which we know both the starting IP address and the offset into the packed dataset. I suggest an index format consisting of (IP address, offset) pairs as fixed-length records; we can look up an IP address by performing a binary search through the index, then searching linearly through the runs in the piece of the packed dataset starting at the specified offset. Specifying the offset has the additional advantage that (if we know how to find the end of the index array) we can later put the packed dataset in the same file as (and following) the index.

comment:10 in reply to: ↑ 8 Changed 3 years ago by rransom

Replying to nickm:

Other thoughts: elsewhere we have used the country code "??" for an unassigned mapping.

We don't have a consistent internal uint8->country mapping in Tor: that needs to get saved too.

Right now the db has 246 country codes; it's be nice to have a good answer here for what we would do if we had to go over 255.

We could use a variable-length integer (possibly encoded as one of a small number of escape values followed by a protobuf variable-length integer).

One possibility for making this format self-synchronizing is to reserve the run-length index 0 and the country code 0 such that neither corresponds to a run length or to a country. There are probably better ways too.

As I explained in my previous comment, we need an index format anyway, so making the run-list format self-synchronizing won't help us.

Another useful observation: there seem to be less than 32 run lengths that ever occur more than once. Not sure what we can use that for, though.

I suspect that we could get a space win by constraining run lengths to a small set such that every run length in the dataset can be expressed as a sum of encodable run lengths (possibly the powers of two, although we could automatically search for a better set) and Huffman-coding or arithmetic-coding the run records; since we have to scan each chunk of runs linearly anyway, this wouldn't make our speed much worse. I think the added code complexity (in both the encoder and decoder) makes fancy coding schemes a bad idea, though.

comment:11 in reply to: ↑ 9 ; follow-up: Changed 3 years ago by nickm

Replying to rransom:

Replying to nickm:

I'm not so sure that having this stuff in separate run-length and cc files will actually be needed; endianness issues will keep us from reading any portable file into an array-of-country verbatim, I think.

The country codes are two-character ASCII strings, and are thus endianness-independent. The run lengths are integers, but could be encoded in big-endian form everywhere.

I thought that the whole point of endian7000's idea was that a lot of the savings came from variable-length run-length encoding. In the database I'm looking at, there are 4212 distinct run-length encodings. Lots of the win comes from encoding the more frequent run-lengths as a single byte and the less frequent ones as two bytes.

To quantify: 136810 of the runs in my geoip file would have their lengths represented as one byte in the var-length encoding, whereas 11586 would take two bytes. Using a fixed-width two-byte encoding for run lengths would add another 133K to the file size.

comment:12 in reply to: ↑ 11 Changed 3 years ago by rransom

Replying to nickm:

Replying to rransom:

Replying to nickm:

I'm not so sure that having this stuff in separate run-length and cc files will actually be needed; endianness issues will keep us from reading any portable file into an array-of-country verbatim, I think.

The country codes are two-character ASCII strings, and are thus endianness-independent. The run lengths are integers, but could be encoded in big-endian form everywhere.

I thought that the whole point of endian7000's idea was that a lot of the savings came from variable-length run-length encoding. In the database I'm looking at, there are 4212 distinct run-length encodings. Lots of the win comes from encoding the more frequent run-lengths as a single byte and the less frequent ones as two bytes.

To quantify: 136810 of the runs in my geoip file would have their lengths represented as one byte in the var-length encoding, whereas 11586 would take two bytes. Using a fixed-width two-byte encoding for run lengths would add another 133K to the file size.

The mapping of run-length codes to run lengths should be stored in a separate file, in which the run lengths are fixed-width big-endian integers, and each run-length code should be an index into that array. The mapping of country identifiers to two-character ISO country codes should be stored similarly. The list of runs should be stored in endian7000's variable-length-record format.

comment:13 Changed 3 years ago by endian7000

   735 country-codes__null-terminated
 16840 run-lengths__uint32be
306539 runs__uvarint

I'm up for implementing the index approach, but I probably won't be able to before this weekend.

comment:14 Changed 3 years ago by nickm

I did some quick tests on what to do if we ever need to handle over 256 country codes. Using a varint in the main data file to encode both run lengths and countries is less than 1% more expensive than using a single byte for countries, so I suggest that we future-proof here by using a varint for both.

I also tried a variant where we only provide special short codes for run-lengths that appear N or more times, and encode the others literally as (total_num_codes+actual_value). No value of N that I tried provided more than 4% savings, so it's probably not worth it.

comment:15 Changed 3 years ago by nickm

I'm attaching a quick and dirty implementation of the format, with indices. (I rote it on a car trip, so I didn't have endian7000's implementation in front of me, and I wanted something to play with.) It sticks the tables and index in one file, and the data in another, though I'm still not sold on having two separate files -- if we do keep files separate, we should do something to keep people from accidentally getting thmem out of sync.

Also, this approach still needs C code. I'm going to suggest that we start by just adding this as another format that we read into our existing in-core format, and then only later add support for keeping the raw data mmap'd for low-RAM environments.

endian7000, are you still interested in working on that?

Changed 3 years ago by nickm

comment:16 Changed 3 years ago by nickm

  • Status changed from needs_review to assigned

Yoinking this out of needs_review for now; we should throw it back in when there is more code to review.

comment:17 Changed 2 years ago by nickm

  • Milestone changed from Tor: 0.2.3.x-final to Tor: 0.2.4.x-final

comment:18 Changed 2 years ago by nils

  • Cc nils@… added

comment:19 follow-up: Changed 22 months ago by dcf

  • Cc dcf@… added

I tried to implement geoip lookup using a binary decision diagram (BDD) rather than a binary search over ranges. With a BDD, you look at a bit of the address, and depending on whether it is a 0 or 1 you jump to another node in a dag to look at another bit, terminating when you get to a country code. In summary: it works, but doesn't save much space. (One should publish negative results, right?) Code is available from git clone http://www.bamsoftware.com/git/geoip-bdd.git.

The geoip database of June 6 2012 has 168366 ranges and is 4.0 MB on disk. A reduced BDD computing the same mapping (geoip-rdd.bdd) has 185170 nodes and is 3.6 MB on disk. A range could reasonably take 10 bytes (int32 ip_low, int32 ip_high, 2-byte country code) for a total of about 1.6 MB in memory; and a BDD node could reasonably fit in 8 bytes (int32 lo pointer and int32 hi pointer, stealing the high 5 bits of a node index for a bit number) for a total of about 1.4 MB in memory.

The size of a BDD depends on its variable ordering. I only tried looking at bits from most-to-least significant. My intuition says that should be close to the best ordering, because of how addresses are allocated. I didn't test any others, partly because ranges aren't contiguous using any other ordering, complicating the construction of the BDD.

comment:20 in reply to: ↑ 19 Changed 22 months ago by rransom

Replying to dcf:

I tried to implement geoip lookup using a binary decision diagram (BDD) rather than a binary search over ranges. With a BDD, you look at a bit of the address, and depending on whether it is a 0 or 1 you jump to another node in a dag to look at another bit, terminating when you get to a country code. In summary: it works, but doesn't save much space. (One should publish negative results, right?)

Yes.

The geoip database of June 6 2012 has 168366 ranges and is 4.0 MB on disk. A reduced BDD computing the same mapping (geoip-rdd.bdd) has 185170 nodes and is 3.6 MB on disk. A range could reasonably take 10 bytes (int32 ip_low, int32 ip_high, 2-byte country code) for a total of about 1.6 MB in memory; and a BDD node could reasonably fit in 8 bytes (int32 lo pointer and int32 hi pointer, stealing the high 5 bits of a node index for a bit number) for a total of about 1.4 MB in memory.

What if you use variable-size integers for the if-0 and if-1 offsets, and store them as offsets from (the end of) the current node?

The size of a BDD depends on its variable ordering. I only tried looking at bits from most-to-least significant. My intuition says that should be close to the best ordering, because of how addresses are allocated. I didn't test any others, partly because ranges aren't contiguous using any other ordering, complicating the construction of the BDD.

If you use a non-fixed ordering, you will probably waste space by having to store a bit index in every node.

(A BDD with fixed bit order is a ‘crit-bit tree’ or ‘PATRICIA tree’ (there are other names). See also http://cr.yp.to/critbit.html.)

comment:21 Changed 19 months ago by nickm

  • Keywords tor-relay added

comment:22 Changed 19 months ago by nickm

  • Component changed from Tor Relay to Tor

comment:23 Changed 15 months ago by nickm

  • Milestone changed from Tor: 0.2.4.x-final to Tor: unspecified
Note: See TracTickets for help on using tickets.