Design file format and Python/Java library for multiple GeoIP or AS databases

added component::metrics utilities owner::karsten priority::medium resolution::implemented status::closed type::enhancement labels

Pyonioonoo had me distracted for a while, getting back to metrics stuff..

Possible first step would be to figure out if there are any additional info that we don't need/use in maxminds db. A naive solution then would be to - Step 0) Remove unnecessary data Step 1) Diff the old csv with the new csv Step 2.1) Add a human readable(?) line to the old csv - explaining the date of change, no of lines changed and possibly other details that might become obvious once we actually try to diff Step 2.2) Modify the diff to make more parseable since we know that we are only diff-ing csv's - i bet we can optimize this a bit Step 3) Append the modified diff to the old csv Step 4) Write a library that can parse added human readable line and the modified diff

Another solution would be to go all out and write our own spec and a parser that converts every newly generated GeoIP db into something that conforms with our spec. (And write a library to parse such a file)

The second approach would be a lot more useful in the long run but a lot more time consuming to write. If we pick either approaches(or an alternative one) I'd be happy to write the python code for it!

Replying to gsathya:

Possible first step would be to figure out if there are any additional info that we don't need/use in maxminds db. A naive solution then would be to - Step 0) Remove unnecessary data Step 1) Diff the old csv with the new csv Step 2.1) Add a human readable(?) line to the old csv - explaining the date of change, no of lines changed and possibly other details that might become obvious once we actually try to diff Step 2.2) Modify the diff to make more parseable since we know that we are only diff-ing csv's - i bet we can optimize this a bit Step 3) Append the modified diff to the old csv Step 4) Write a library that can parse added human readable line and the modified diff

Another solution would be to go all out and write our own spec and a parser that converts every newly generated GeoIP db into something that conforms with our spec. (And write a library to parse such a file)

The second approach would be a lot more useful in the long run but a lot more time consuming to write. If we pick either approaches(or an alternative one) I'd be happy to write the python code for it!

Your first approach above already sounds like a design for a file format, and I admit that the second approach requires a lot of work before seeing any results.

Hmm. How about a third approach: write a library that a) reads unmodified database files to memory, maybe together with a mapping file containing dates when these files became valid, and b) resolves IP addresses and dates to country codes or ASes. We wouldn't want to memorize the full file contents, but only the relevant information for looking up an IP address and date. But we can still wonder about a compact file format for that later on.

This third approach has the disadvantage that initializing the lookup library may take a while (tens of seconds, maybe minutes). But it reduces development time a lot at the beginning. Also, we may learn a lot about compact representations of address ranges, dates, country codes, and ASes which we can use to design a good file format later on.

Replying to karsten:

Hmm. How about a third approach: write a library that a) reads unmodified database files to memory, maybe together with a mapping file containing dates when these files became valid, and b) resolves IP addresses and dates to country codes or ASes. We wouldn't want to memorize the full file contents, but only the relevant information for looking up an IP address and date. But we can still wonder about a compact file format for that later on.

This means we'll have to store multiple GeoIP database files. Is this a good idea? wouldn't it be better to have a single file?

Replying to gsathya:

This means we'll have to store multiple GeoIP database files. Is this a good idea? wouldn't it be better to have a single file?

Yes, it would be better to have a single file, but it requires more development work until we're there, and we won't see any results until we're done with everything. Having a solution with multiple files allows us to have quicker results and make more experience with keeping multiple GeoIP databases in memory, which in turn may reduce the effort to come up with a new single file format.

At least that's how I'd approach the problem. If you like your approach better, please feel free to do that one instead.

Replying to karsten:

Replying to gsathya:

This means we'll have to store multiple GeoIP database files. Is this a good idea? wouldn't it be better to have a single file?

Yes, it would be better to have a single file, but it requires more development work until we're there, and we won't see any results until we're done with everything. Having a solution with multiple files allows us to have quicker results and make more experience with keeping multiple GeoIP databases in memory, which in turn may reduce the effort to come up with a new single file format.

This means doing the whole file format dance twice. Not sure if this is a good idea? I'm not sure what the next step is here :/

Not sure how I can help. I only described how I would approach this problem, but it's quite possible your approach turns out to be better. I suggest taking your preferred approach.

I may start hacking on this today or tomorrow. Grabbing the ticket. (If I'm not making any progress until a week from now, I'll reassign it to the 'none' guy.)

Trac:
Owner: N/A to karsten
Status: new to assigned

#2506 (moved) has a link to possibly-useful Python code.

Replying to rransom:

#2506 (moved) has a link to possibly-useful Python code.

Maybe, yes. Interesting discussion there! The focus is somewhat different though. I'd sacrifice compact representation of the data structure holding multiple GeoIP databases for better lookup speed. We might store dozens of GeoIP databases in a single structure for metrics, and we'll do bulk lookups. If that takes tens of MiB on disk and a multiple of that in RAM, but therefore finishes within a few minutes, not hours, I'm okay with that.

So, I produced an initial version of the multi-GeoIP database in Java. The unit tests I wrote all succeed, but there might still be edge cases that break. I'd much appreciate a review of the design, which is described in the comments. If the design turns out to be good, we'll want a Python implementation of this Java thing. (Such an implementation might benefit from the Python code in #2506 (moved).)

Trac:
Status: assigned to needs_review
Summary: Design file format and Python library for multiple GeoIP or AS databases to Design file format and Python/Java library for multiple GeoIP or AS databases

Replying to karsten:

Replying to rransom:

#2506 (moved) has a link to possibly-useful Python code.

Maybe, yes. Interesting discussion there! The focus is somewhat different though. I'd sacrifice compact representation of the data structure holding multiple GeoIP databases for better lookup speed. We might store dozens of GeoIP databases in a single structure for metrics, and we'll do bulk lookups. If that takes tens of MiB on disk and a multiple of that in RAM, but therefore finishes within a few minutes, not hours, I'm okay with that.

If you have enough RAM to burn on using your programming language's general-purpose data structures, and your disks are fast enough to load the data from text files, go for it. (General-purpose compression of the files on disk would be helpful if you're bottlenecked by disk IO.) But do keep the data structures in #2506 (moved) in mind in case you need to reduce the load on whatever hardware you're using.

In particular, a ‘crit-bit tree’ (see [ticket:2506#comment:20] and [ticket:2506#comment:19]) should provide very fast in-memory lookup, if you can implement it in a sufficiently fast language (i.e. not Python).

Here we go. I finished the initial version of a multi-GeoIP database in Java. It's now capable of handling five or more years of monthly databases, which is about as much as we need for metrics. And it's incredibly fast and memory-friendly! Some facts:

Looking up 100,000 IP address/date combinations in the respectively best matching database takes 600 milliseconds overall.
Combining five years of data into a single database takes 46 seconds, but after combining them and storing the result to disk once, loading everything back to memory only takes 1.3 seconds.
The total memory consumption for five years of GeoIP data is under 150 MB.

Here's the output of a performance test that imports 60 monthly GeoIP databases and performs 100000 random lookups:

Generating test cases... 52705 millis.
Importing files... 45855 millis.
Making test requests... 623 millis, 0 out of 100000 tests failed.
Database contains 60 databases and 146629 combined address ranges.
Performed 6042264 address range imports requiring 12886981 lookups.
Performed 100000 address lookups requiring 114316 lookups.
First 10 entries, in reverse order, are:
  223.255.255.0 223.255.255.255 au 20110901 20120901 12
  223.255.254.0 223.255.254.255 sg 20110501 20120901 16
  223.255.252.0 223.255.253.255 cn 20110501 20120901 16
  223.255.248.0 223.255.251.255 au 20100801 20120901 25
  223.255.244.0 223.255.247.255 in 20100901 20120901 24
  223.255.240.0 223.255.243.255 hk 20100901 20120901 24
  223.255.236.0 223.255.239.255 cn 20110401 20120901 17
  223.255.232.0 223.255.235.255 au 20100901 20120901 24
  223.255.224.0 223.255.231.255 id 20100901 20120901 24
  223.255.192.0 223.255.223.255 kr 20100901 20120901 24
Saving combined databases to disk... 849 millis.
Loading combined databases from disk... 1253 millis.
Making a second round of test requests... 591 millis, 0 out of 100000 tests failed.

Maybe the next step is to rewrite this code in Python? There shouldn't be anything too Java-specific that couldn't be implemented in Python, I think. The Java code is available here, and I uploaded the 5*60 input files from the regional registries here and the combined database file here.

Unassigning this ticket from me, because I'm planning to ignore it for the next few weeks, and I don't want to block anybody else wanting to work on it.

Sathya, if you're interested in doing the Python part, feel free to grab the ticket. (If not, or if you're already overloaded with other stuff, that's fine, too.)

Trac:
Owner: karsten to N/A
Status: needs_review to assigned

Replying to karsten:

Sathya, if you're interested in doing the Python part, feel free to grab the ticket. (If not, or if you're already overloaded with other stuff, that's fine, too.)

I want to hack on this :( But I don't think I can right now. If anyone else wants to do it, please feel free to take it up.

Okay, I think I got this into a needs_review state. gsathya, atagar, if you don't mind, please review the Python part and help me turn this code into a simple metrics utility library.

Here's what the code does:

The Java code combines GeoIP or ASN databases and produces a single combined database file. That file has lines like the following:

223.255.254.0,223.255.254.255,AS55415,20110501,20121101
223.255.247.0,223.255.247.255,AS45954,20101101,20121101

The first line means that addresses 223.255.254.0 to 223.255.254.255 were assigned to AS55415 in the databases published from 20110501 to 20121101.

The Python code reads a combined database file and answers questions to which country or AS a given IP address was assigned on a given date. It uses Python's lists and the bisect module to get at least anywhere close to Java's performance.

I created two example combined database files for Maxmind's GeoLiteCity and ASN databases.

Having date-based ip-to-country or ip-to-asn lookups is relevant for #6232 (moved), among other things.

Trac:
Cc: gsathya to gsathya, atagar
Status: assigned to needs_review

atagar mentions on IRC that OrderedDict might be an alternative to bisect. It's the equivalent of LinkedHashMap in Java. This does indeed look potentially more efficient.

def address_ston(address_string): try: address_struct = socket.inet_pton(socket.AF_INET, address_string) except socket.error: raise ValueError return struct.unpack('!I', address_struct)[0]

I'm sure that you don't want to add a dependency just for this sort of functionality, but just a fyi that I have a few IP utilities that you might find to be helpful...

https://gitweb.torproject.org/stem.git/blob/HEAD:/stem/util/connection.py

I wrote them to support exit policies, in particular checking if a given endpoint falls under a particular IP/mask...

https://gitweb.torproject.org/stem.git/blob/HEAD:/stem/exit_policy.py

The get_address_binary() can do something similar to what you have here...

>>> import socket
>>> import struct
>>> address_struct = socket.inet_pton(socket.AF_INET, "127.0.0.1")
>>> struct.unpack('!I', address_struct)[0]
2130706433

>>> from stem.util import connection
>>> address_bin =  connection.get_address_binary("127.0.0.1")
>>> print address_bin
01111111000000000000000000000001
>>> int(address_bin, 2)
2130706433

def str(self): return "%s,%s,%s,%s,%s" %
(Database.address_ntos(self.start_address), Database.address_ntos(self.end_address), self.code, Database.date_ntos(self.start_date), Database.date_ntos(self.end_date))

Any advantage to this verses just saving the 'line' argument we were constructed from?

def date_ston(date_string): def address_ntos(address): def date_kton(key):

I haven't a clue what any of these acronyms mean. My understanding is that shortening function names to some arcane, overly cramped abbreviation is an artifact of old-time C development where saving every byte of space mattered. Mind coming up with more descriptive names?

return int(date_datetime.strftime('%s')) / 86400

It took me a second to figure out where the 86400 came from. You might want to comment that.

for line in input_file.readlines():

File objects themselves are iterable over the lines. I suspect that calling readlines() here is creating a defensive copy with a list of lines, so this causes you to read the file into memory twice (both for this list and what we add while processing it).

>>> with open('/tmp/foo') as input_file:
...   for line in input_file:
...     print line.strip()
... 
pepperjack is
very tasty cheese!

Cheers! -Damian

Trac:
Status: needs_review to needs_revision


  93   def add_range(self, line):
  94     r = Range(line)
  95     self.data.append((r.key, r))

You might want to rename this to _add_range since you don't want anyone using this method(since pygeodate is a lib) -- using this method will make self.data unsorted and miss a key from self.keys. Or you could be more safe and just remove the function since it's just two lines.


  76   def load_combined_databases(self, path):
  77     with open(path) as input_file:
  78       for line in input_file.readlines():
  79         line = line.strip()
  80         if line.startswith('!'):
  81           self.add_date(line)
  82           continue
  83         else:
  84           self.add_range(line)

What is the use of the continue here?

Replying to atagar:

I think I just made all changes you suggested, but I'm having trouble with the following:

for line in input_file.readlines(): File objects themselves are iterable over the lines. I suspect that calling readlines() here is creating a defensive copy with a list of lines, so this causes you to read the file into memory twice (both for this list and what we add while processing it).

Are you certain that the current code leads to reading the file into memory twice?

Replying to gsathya:

I just removed the two functions add_date and add_range, and I got rid of the continue. Thanks!

How do we proceed with turning this simple Python script into a library that can be used by #6232 (moved)? In theory, we could start by simply copying the parts we need over, but maybe there's a smarter way.

Trac:
Status: needs_revision to needs_review

octets = address_string.split('.') return long(''.join(["%02X" % long(octet) for octet in octets]), 16)

Nice. :)

Are you certain that the current code leads to reading the file into memory twice?

Pretty sure. What makes you think that it doesn't?

Calling 'file.readlines()' reads the whole file into a list of newline separated strings. At this point you've read the whole file once into memory, and then you iterate over the entires and append data for each entry.

This is a bit similar to the difference between python's range() and xrange() function. Calling range() gives you a list (which is iterable) while xrange() gives you an iterator. Hence...

for i in range(1000000000):
  print i

... means making a list of a billion ints in memory then printing each while...

for i in xrange(1000000000):
  print i

... has constant memory usage because xrange() provides the sequence on demand.

Personally I think that it's stupid that the python compiler isn't smart enough to say "the developer's using range() or readlines() in a loop, hence provide an iterator rather than a list", and maybe it does in newer python versions. I wouldn't count on it though.

Replying to karsten:

Replying to gsathya:

I just removed the two functions add_date and add_range, and I got rid of the continue. Thanks! Great! How do we proceed with turning this simple Python script into a library that can be used by #6232 (moved)? In theory, we could start by simply copying the parts we need over, but maybe there's a smarter way. Why not just import pygeodate? No need to copy parts. We can remove pygeoip as well.

Replying to atagar:

Are you certain that the current code leads to reading the file into memory twice?

Pretty sure. What makes you think that it doesn't?

For some reason, your earlier suggestion broke for me, so I thought my approach was how it was supposed to be. Turns out that was something else, and everything works fine now. Changed. Thanks!

Replying to gsathya:

Replying to karsten:

How do we proceed with turning this simple Python script into a library that can be used by #6232 (moved)? In theory, we could start by simply copying the parts we need over, but maybe there's a smarter way. Why not just import pygeodate? No need to copy parts. We can remove pygeoip as well.

This is mostly a question of where the code is supposed to live, and how to make it more library-like. The idea of the metrics-tasks.git repository is to keep code of ongoing or completed projects in case we need it for future projects. Nothing else should use code from metrics-tasks.

I wonder if we should move the task-6471/ directory to metrics-utils.git as pygeodate/. What do you think about that? How's the directory layout for a Python library supposed to look like?

Trac:
Status: needs_review to accepted
Owner: N/A to karsten

Replying to karsten:

I wonder if we should move the task-6471/ directory to metrics-utils.git as pygeodate/. What do you think about that? How's the directory layout for a Python library supposed to look like?

Make a new pygeodate.git? We can just clone that and use it whenever. metrics-utils.git is also fine, but we'll have to clone and then delete the rest of the stuff in metrics-utils. pygeodate/pygeodate.py would be ok. /pygeodate.py would be better. The former would involve doing import pygeodate.pygeodate as pygeodate while the latter would be import pygeodate.

Replying to gsathya:

Make a new pygeodate.git? We can just clone that and use it whenever. metrics-utils.git is also fine, but we'll have to clone and then delete the rest of the stuff in metrics-utils. pygeodate/pygeodate.py would be ok. /pygeodate.py would be better. The former would involve doing import pygeodate.pygeodate as pygeodate while the latter would be import pygeodate.

I'd rather want to avoid creating yet another repository for this tool. I'm currently working on blockfinder together with Jake, where I want to include support for multiple GeoIP or AS databases. I plan to reuse some of the pygeodate.py code for that. This is going to take another month or two, so I suggest we copy pygeodate.py to wherever we need it, including subdirectories of metrics-tasks.git. But moving the multi-database code code to blockfinder is the better solution in the long term, because it's more likely to see maintenance updates if it's contained in a tool that we use for other purposes, too. I'm closing this ticket, because the task is done.

Trac:
Status: accepted to closed
Resolution: N/A to implemented

closed

Design file format and Python/Java library for multiple GeoIP or AS databases

Child items ...

Activity