Design and implement a more compact GeoIP file format

changed milestone to %Tor: unspecified

added component::core tor/tor disk download-size memory metrics-geoip milestone::Tor: unspecified owner::endian7000 priority::medium severity::normal sponsor8-maybe status::assigned tor-relay type::enhancement labels

The calculations above are specific to IPv4; for IPv6, we would need to do some research into what current IPv6 IP-to-country datasets look like before designing a new file format for them.

Trac:
Actualpoints: N/A to N/A

Trac:
Milestone: N/A to Tor: unspecified

Just another data point, moot if the original intent is to always read from disk, never load the whole GeoIP database into RAM.

But, if the intent is to save disk space once installed, LZMA produces decent results (tested with 7zip and April 2011 files): 1.1M GeoIP.dat compresses into 436KiB dat.7z 3.5M geoip compresses into 551KiB txt.7z

Trac:
Username: mikey

Replying to mikey:

Just another data point, moot if the original intent is to always read from disk, never load the whole GeoIP database into RAM.

But, if the intent is to save disk space once installed, LZMA produces decent results (tested with 7zip and April 2011 files): 1.1M GeoIP.dat compresses into 436KiB dat.7z

In order to use Maxmind's binary format, we would need to either use their (LGPLed) library or reverse-engineer the format. If we implement our own binary format, I strongly suspect that we can fit the data we need into less than 436 kiB without additional compression. It would also be better to avoid using LZMA, as that would require linking Tor to yet another library.

I wrote a simple encoder with these results:

317,108 bytes: geoip-encoded

205,123 bytes: geoip-encoded.gz

https://github.com/andrewschaaf/geoip-packing

Does this look good? I can write a C decoder.

Also, I'm new to Tor development (thanks, Chuck Schumer!) -- how does one submit pull requests for the Tor repo? I'm only familiar with contributing to projects via GitHub.

Trac:
Username: endian7000
Owner: N/A to endian7000
Status: new to assigned

Replying to endian7000:

I wrote a simple encoder with these results:

317,108 bytes: geoip-encoded

Great! That's better than I expected!

Keep in mind that if you want to use the dataset efficiently without unpacking all of it into a larger in-memory form, you will need to make some changes:

Put the country-code and run-length maps in separate files, rather than sticking all of the data in a single file. These files will be easiest to use if their elements have constant length (so all we need to do is mmap them or read them into buffers, then access them as arrays).
Generate an index file. (The index format I suggested requires that each run be encoded with the same length; you'll need to specify the position of each block of runs in addition to the IP address of the first run in the block with your variable-length encoding.)

But this would be quite an improvement even if you just modified our current geoip.c to read a dataset from your format.

205,123 bytes: geoip-encoded.gz

That's even better!

https://github.com/andrewschaaf/geoip-packing

Does this look good? I can write a C decoder.

Great! Start by reading src/or/geoip.c and its header file on master, since that's where this change is likely to be merged to.

Also, I'm new to Tor development (thanks, Chuck Schumer!) -- how does one submit pull requests for the Tor repo? I'm only familiar with contributing to projects via GitHub.

Post the Git URL and branch name on the relevant Trac ticket, and set the ticket's status to needs_review. I include a link to Gitweb for my branches as well (see #3000 (moved) for an example), but that's optional.

Also see the [WikiFormatting] page for how to escape an URL so Trac will let us select it (to paste it into a terminal).

Also, please avoid committing large binaries into Git branches; we will want to translate the GeoIP database into binary form either during the usual make process or when building a tarball. (Yes, 200 kB is large for a binary file in Git.)

Trac:
Milestone: Tor: unspecified to Tor: 0.2.3.x-final

Throwing this into needs_review myself.

I'm not so sure that having this stuff in separate run-length and cc files will actually be needed; endianness issues will keep us from reading any portable file into an array-of-country verbatim, I think. Let's see how tricky the read code is before we decide that complexifying the format is worth it in order to make the read code simpler.

I'm also not clear how best to read this format quickly on the fly: unpacking it all into ram seems like a lose if we don't have to; a workable index format would be neat (and would be much easier for fixed-length or self-synchronizing records.

Trac:
Status: assigned to needs_review

Other thoughts: elsewhere we have used the country code "??" for an unassigned mapping.

We don't have a consistent internal uint8->country mapping in Tor: that needs to get saved too.

Right now the db has 246 country codes; it's be nice to have a good answer here for what we would do if we had to go over 255.

One possibility for making this format self-synchronizing is to reserve the run-length index 0 and the country code 0 such that neither corresponds to a run length or to a country. There are probably better ways too.

Another useful observation: there seem to be less than 32 run lengths that ever occur more than once. Not sure what we can use that for, though.

Replying to nickm:

I'm not so sure that having this stuff in separate run-length and cc files will actually be needed; endianness issues will keep us from reading any portable file into an array-of-country verbatim, I think.

The country codes are two-character ASCII strings, and are thus endianness-independent. The run lengths are integers, but could be encoded in big-endian form everywhere.

Let's see how tricky the read code is before we decide that complexifying the format is worth it in order to make the read code simpler.

I think putting each separate array in a separate file would give us a simpler format than putting all of the arrays in a single file.

I'm also not clear how best to read this format quickly on the fly: unpacking it all into ram seems like a lose if we don't have to; a workable index format would be neat (and would be much easier for fixed-length or self-synchronizing records.

We would need an index format even with fixed-length records -- each record corresponds to a wildly varying amount of IP address space. We look up the country associated with an IP address, so we need to read the database in order from some starting point for which we know both the starting IP address and the offset into the packed dataset. I suggest an index format consisting of (IP address, offset) pairs as fixed-length records; we can look up an IP address by performing a binary search through the index, then searching linearly through the runs in the piece of the packed dataset starting at the specified offset. Specifying the offset has the additional advantage that (if we know how to find the end of the index array) we can later put the packed dataset in the same file as (and following) the index.

Replying to nickm:

Other thoughts: elsewhere we have used the country code "??" for an unassigned mapping.

We don't have a consistent internal uint8->country mapping in Tor: that needs to get saved too.

Right now the db has 246 country codes; it's be nice to have a good answer here for what we would do if we had to go over 255.

We could use a variable-length integer (possibly encoded as one of a small number of escape values followed by a protobuf variable-length integer).

One possibility for making this format self-synchronizing is to reserve the run-length index 0 and the country code 0 such that neither corresponds to a run length or to a country. There are probably better ways too.

As I explained in my previous comment, we need an index format anyway, so making the run-list format self-synchronizing won't help us.

Another useful observation: there seem to be less than 32 run lengths that ever occur more than once. Not sure what we can use that for, though.

I suspect that we could get a space win by constraining run lengths to a small set such that every run length in the dataset can be expressed as a sum of encodable run lengths (possibly the powers of two, although we could automatically search for a better set) and Huffman-coding or arithmetic-coding the run records; since we have to scan each chunk of runs linearly anyway, this wouldn't make our speed much worse. I think the added code complexity (in both the encoder and decoder) makes fancy coding schemes a bad idea, though.

Replying to rransom:

Replying to nickm:

I'm not so sure that having this stuff in separate run-length and cc files will actually be needed; endianness issues will keep us from reading any portable file into an array-of-country verbatim, I think.

The country codes are two-character ASCII strings, and are thus endianness-independent. The run lengths are integers, but could be encoded in big-endian form everywhere.

I thought that the whole point of endian7000's idea was that a lot of the savings came from variable-length run-length encoding. In the database I'm looking at, there are 4212 distinct run-length encodings. Lots of the win comes from encoding the more frequent run-lengths as a single byte and the less frequent ones as two bytes.

To quantify: 136810 of the runs in my geoip file would have their lengths represented as one byte in the var-length encoding, whereas 11586 would take two bytes. Using a fixed-width two-byte encoding for run lengths would add another 133K to the file size.

Replying to nickm:

Replying to rransom:

Replying to nickm:

I'm not so sure that having this stuff in separate run-length and cc files will actually be needed; endianness issues will keep us from reading any portable file into an array-of-country verbatim, I think.

The country codes are two-character ASCII strings, and are thus endianness-independent. The run lengths are integers, but could be encoded in big-endian form everywhere.

I thought that the whole point of endian7000's idea was that a lot of the savings came from variable-length run-length encoding. In the database I'm looking at, there are 4212 distinct run-length encodings. Lots of the win comes from encoding the more frequent run-lengths as a single byte and the less frequent ones as two bytes.

To quantify: 136810 of the runs in my geoip file would have their lengths represented as one byte in the var-length encoding, whereas 11586 would take two bytes. Using a fixed-width two-byte encoding for run lengths would add another 133K to the file size.

The mapping of run-length codes to run lengths should be stored in a separate file, in which the run lengths are fixed-width big-endian integers, and each run-length code should be an index into that array. The mapping of country identifiers to two-character ISO country codes should be stored similarly. The list of runs should be stored in endian7000's variable-length-record format.


   735 country-codes__null-terminated
 16840 run-lengths__uint32be
306539 runs__uvarint

I'm up for implementing the index approach, but I probably won't be able to before this weekend.

Trac:
Username: endian7000

I did some quick tests on what to do if we ever need to handle over 256 country codes. Using a varint in the main data file to encode both run lengths and countries is less than 1% more expensive than using a single byte for countries, so I suggest that we future-proof here by using a varint for both.

I also tried a variant where we only provide special short codes for run-lengths that appear N or more times, and encode the others literally as (total_num_codes+actual_value). No value of N that I tried provided more than 4% savings, so it's probably not worth it.

I'm attaching a quick and dirty implementation of the format, with indices. (I rote it on a car trip, so I didn't have endian7000's implementation in front of me, and I wanted something to play with.) It sticks the tables and index in one file, and the data in another, though I'm still not sold on having two separate files -- if we do keep files separate, we should do something to keep people from accidentally getting thmem out of sync.

Also, this approach still needs C code. I'm going to suggest that we start by just adding this as another format that we read into our existing in-core format, and then only later add support for keeping the raw data mmap'd for low-RAM environments.

endian7000, are you still interested in working on that?

Trac:
geoip_compress.py

Yoinking this out of needs_review for now; we should throw it back in when there is more code to review.

Trac:
Status: needs_review to assigned

Trac:
Milestone: Tor: 0.2.3.x-final to Tor: 0.2.4.x-final

Trac:
Username: nils
Cc: N/A to nils@shkoo.com

I tried to implement geoip lookup using a binary decision diagram (BDD) rather than a binary search over ranges. With a BDD, you look at a bit of the address, and depending on whether it is a 0 or 1 you jump to another node in a dag to look at another bit, terminating when you get to a country code. In summary: it works, but doesn't save much space. (One should publish negative results, right?) Code is available from git clone http://www.bamsoftware.com/git/geoip-bdd.git.

The geoip database of June 6 2012 has 168366 ranges and is 4.0 MB on disk. A reduced BDD computing the same mapping (geoip-rdd.bdd) has 185170 nodes and is 3.6 MB on disk. A range could reasonably take 10 bytes (int32 ip_low, int32 ip_high, 2-byte country code) for a total of about 1.6 MB in memory; and a BDD node could reasonably fit in 8 bytes (int32 lo pointer and int32 hi pointer, stealing the high 5 bits of a node index for a bit number) for a total of about 1.4 MB in memory.

The size of a BDD depends on its variable ordering. I only tried looking at bits from most-to-least significant. My intuition says that should be close to the best ordering, because of how addresses are allocated. I didn't test any others, partly because ranges aren't contiguous using any other ordering, complicating the construction of the BDD.

Trac:
Cc: nils@shkoo.com to nils@shkoo.com, dcf@torproject.org

Replying to dcf:

I tried to implement geoip lookup using a binary decision diagram (BDD) rather than a binary search over ranges. With a BDD, you look at a bit of the address, and depending on whether it is a 0 or 1 you jump to another node in a dag to look at another bit, terminating when you get to a country code. In summary: it works, but doesn't save much space. (One should publish negative results, right?)

Yes.

The geoip database of June 6 2012 has 168366 ranges and is 4.0 MB on disk. A reduced BDD computing the same mapping (geoip-rdd.bdd) has 185170 nodes and is 3.6 MB on disk. A range could reasonably take 10 bytes (int32 ip_low, int32 ip_high, 2-byte country code) for a total of about 1.6 MB in memory; and a BDD node could reasonably fit in 8 bytes (int32 lo pointer and int32 hi pointer, stealing the high 5 bits of a node index for a bit number) for a total of about 1.4 MB in memory.

What if you use variable-size integers for the if-0 and if-1 offsets, and store them as offsets from (the end of) the current node?

The size of a BDD depends on its variable ordering. I only tried looking at bits from most-to-least significant. My intuition says that should be close to the best ordering, because of how addresses are allocated. I didn't test any others, partly because ranges aren't contiguous using any other ordering, complicating the construction of the BDD.

If you use a non-fixed ordering, you will probably waste space by having to store a bit index in every node.

(A BDD with fixed bit order is a ‘crit-bit tree’ or ‘PATRICIA tree’ (there are other names). See also [http://cr.yp.to/critbit.html].)

Trac:
Keywords: N/A deleted, tor-relay added

Trac:
Component: Tor Relay to Tor

Trac:
Milestone: Tor: 0.2.4.x-final to Tor: unspecified

Trac:
Reviewer: N/A to N/A
Keywords: tor-relay deleted, tor-relay memory disk download-size added
Severity: N/A to Normal
Sponsor: N/A to N/A

Trac:
Keywords: tor-relay memory disk download-size deleted, sponsor8-maybe, memory, download-size, disk, tor-relay added

To clarify, is this ticket mainly about a disk format, to save disk space for Tor installs? Or is it also about a memory format, to save process size and let Tor clients run on tinier platforms?

It can be for both.

Trac:
Keywords: N/A deleted, metrics-geoip added

mentioned in issue #6471 (moved)

moved to tpo/core/tor#2506 (closed)

Design and implement a more compact GeoIP file format

Child items 0

Activity