Whenever we look up relay IP addresses in GeoIP or AS databases covering more than a few months, we can't just use a single database. There's a reason why these databases change over time. An IP address that is resolved to a certain country or AS may have belonged to a different country or AS a year ago. What we should really do is use multiple databases and look up the IP address in the database that was most recent at the given time.
How about we design a file format for multiple GeoIP or AS databases and write a Python library for using it? The library should allow us to:
Convert a given GeoIP or AS database from Maxmind or using a similar format to our format.
Merge a new single GeoIP or AS database into our format.
Look up the country code/name or AS number/name of an IP address on a given date.
I'd say that non-functional requirements are as follows, from most important to least important: lookup speed, lookup speed, lookup speed, memory consumption, file size, library code complexity.
Pyonioonoo had me distracted for a while, getting back to metrics stuff..
Possible first step would be to figure out if there are any additional info that we don't need/use in maxminds db.
A naive solution then would be to -
Step 0) Remove unnecessary data
Step 1) Diff the old csv with the new csv
Step 2.1) Add a human readable(?) line to the old csv - explaining the date of change, no of lines changed and possibly other details that might become obvious once we actually try to diff
Step 2.2) Modify the diff to make more parseable since we know that we are only diff-ing csv's - i bet we can optimize this a bit
Step 3) Append the modified diff to the old csv
Step 4) Write a library that can parse added human readable line and the modified diff
Another solution would be to go all out and write our own spec and a parser that converts every newly generated GeoIP db into something that conforms with our spec. (And write a library to parse such a file)
The second approach would be a lot more useful in the long run but a lot more time consuming to write. If we pick either approaches(or an alternative one) I'd be happy to write the python code for it!
Possible first step would be to figure out if there are any additional info that we don't need/use in maxminds db.
A naive solution then would be to -
Step 0) Remove unnecessary data
Step 1) Diff the old csv with the new csv
Step 2.1) Add a human readable(?) line to the old csv - explaining the date of change, no of lines changed and possibly other details that might become obvious once we actually try to diff
Step 2.2) Modify the diff to make more parseable since we know that we are only diff-ing csv's - i bet we can optimize this a bit
Step 3) Append the modified diff to the old csv
Step 4) Write a library that can parse added human readable line and the modified diff
Another solution would be to go all out and write our own spec and a parser that converts every newly generated GeoIP db into something that conforms with our spec. (And write a library to parse such a file)
The second approach would be a lot more useful in the long run but a lot more time consuming to write. If we pick either approaches(or an alternative one) I'd be happy to write the python code for it!
Your first approach above already sounds like a design for a file format, and I admit that the second approach requires a lot of work before seeing any results.
Hmm. How about a third approach: write a library that a) reads unmodified database files to memory, maybe together with a mapping file containing dates when these files became valid, and b) resolves IP addresses and dates to country codes or ASes. We wouldn't want to memorize the full file contents, but only the relevant information for looking up an IP address and date. But we can still wonder about a compact file format for that later on.
This third approach has the disadvantage that initializing the lookup library may take a while (tens of seconds, maybe minutes). But it reduces development time a lot at the beginning. Also, we may learn a lot about compact representations of address ranges, dates, country codes, and ASes which we can use to design a good file format later on.
Hmm. How about a third approach: write a library that a) reads unmodified database files to memory, maybe together with a mapping file containing dates when these files became valid, and b) resolves IP addresses and dates to country codes or ASes. We wouldn't want to memorize the full file contents, but only the relevant information for looking up an IP address and date. But we can still wonder about a compact file format for that later on.
This means we'll have to store multiple GeoIP database files. Is this a good idea? wouldn't it be better to have a single file?
This means we'll have to store multiple GeoIP database files. Is this a good idea? wouldn't it be better to have a single file?
Yes, it would be better to have a single file, but it requires more development work until we're there, and we won't see any results until we're done with everything. Having a solution with multiple files allows us to have quicker results and make more experience with keeping multiple GeoIP databases in memory, which in turn may reduce the effort to come up with a new single file format.
At least that's how I'd approach the problem. If you like your approach better, please feel free to do that one instead.
This means we'll have to store multiple GeoIP database files. Is this a good idea? wouldn't it be better to have a single file?
Yes, it would be better to have a single file, but it requires more development work until we're there, and we won't see any results until we're done with everything. Having a solution with multiple files allows us to have quicker results and make more experience with keeping multiple GeoIP databases in memory, which in turn may reduce the effort to come up with a new single file format.
This means doing the whole file format dance twice. Not sure if this is a good idea? I'm not sure what the next step is here :/
Not sure how I can help. I only described how I would approach this problem, but it's quite possible your approach turns out to be better. I suggest taking your preferred approach.
I may start hacking on this today or tomorrow. Grabbing the ticket. (If I'm not making any progress until a week from now, I'll reassign it to the 'none' guy.)
Trac: Owner: N/Ato karsten Status: new to assigned
#2506 (moved) has a link to possibly-useful Python code.
Maybe, yes. Interesting discussion there! The focus is somewhat different though. I'd sacrifice compact representation of the data structure holding multiple GeoIP databases for better lookup speed. We might store dozens of GeoIP databases in a single structure for metrics, and we'll do bulk lookups. If that takes tens of MiB on disk and a multiple of that in RAM, but therefore finishes within a few minutes, not hours, I'm okay with that.
So, I produced an initial version of the multi-GeoIP database in Java. The unit tests I wrote all succeed, but there might still be edge cases that break. I'd much appreciate a review of the design, which is described in the comments. If the design turns out to be good, we'll want a Python implementation of this Java thing. (Such an implementation might benefit from the Python code in #2506 (moved).)
Trac: Status: assigned to needs_review Summary: Design file format and Python library for multiple GeoIP or AS databases to Design file format and Python/Java library for multiple GeoIP or AS databases
#2506 (moved) has a link to possibly-useful Python code.
Maybe, yes. Interesting discussion there! The focus is somewhat different though. I'd sacrifice compact representation of the data structure holding multiple GeoIP databases for better lookup speed. We might store dozens of GeoIP databases in a single structure for metrics, and we'll do bulk lookups. If that takes tens of MiB on disk and a multiple of that in RAM, but therefore finishes within a few minutes, not hours, I'm okay with that.
If you have enough RAM to burn on using your programming language's general-purpose data structures, and your disks are fast enough to load the data from text files, go for it. (General-purpose compression of the files on disk would be helpful if you're bottlenecked by disk IO.) But do keep the data structures in #2506 (moved) in mind in case you need to reduce the load on whatever hardware you're using.
In particular, a ‘crit-bit tree’ (see [ticket:2506#comment:20] and [ticket:2506#comment:19]) should provide very fast in-memory lookup, if you can implement it in a sufficiently fast language (i.e. not Python).
Here we go. I finished the initial version of a multi-GeoIP database in Java. It's now capable of handling five or more years of monthly databases, which is about as much as we need for metrics. And it's incredibly fast and memory-friendly! Some facts:
Looking up 100,000 IP address/date combinations in the respectively best matching database takes 600 milliseconds overall.
Combining five years of data into a single database takes 46 seconds, but after combining them and storing the result to disk once, loading everything back to memory only takes 1.3 seconds.
The total memory consumption for five years of GeoIP data is under 150 MB.
Here's the output of a performance test that imports 60 monthly GeoIP databases and performs 100000 random lookups:
Generating test cases... 52705 millis.Importing files... 45855 millis.Making test requests... 623 millis, 0 out of 100000 tests failed.Database contains 60 databases and 146629 combined address ranges.Performed 6042264 address range imports requiring 12886981 lookups.Performed 100000 address lookups requiring 114316 lookups.First 10 entries, in reverse order, are: 223.255.255.0 223.255.255.255 au 20110901 20120901 12 223.255.254.0 223.255.254.255 sg 20110501 20120901 16 223.255.252.0 223.255.253.255 cn 20110501 20120901 16 223.255.248.0 223.255.251.255 au 20100801 20120901 25 223.255.244.0 223.255.247.255 in 20100901 20120901 24 223.255.240.0 223.255.243.255 hk 20100901 20120901 24 223.255.236.0 223.255.239.255 cn 20110401 20120901 17 223.255.232.0 223.255.235.255 au 20100901 20120901 24 223.255.224.0 223.255.231.255 id 20100901 20120901 24 223.255.192.0 223.255.223.255 kr 20100901 20120901 24Saving combined databases to disk... 849 millis.Loading combined databases from disk... 1253 millis.Making a second round of test requests... 591 millis, 0 out of 100000 tests failed.
Maybe the next step is to rewrite this code in Python? There shouldn't be anything too Java-specific that couldn't be implemented in Python, I think. The Java code is available here, and I uploaded the 5*60 input files from the regional registries here and the combined database file here.
Unassigning this ticket from me, because I'm planning to ignore it for the next few weeks, and I don't want to block anybody else wanting to work on it.
Sathya, if you're interested in doing the Python part, feel free to grab the ticket. (If not, or if you're already overloaded with other stuff, that's fine, too.)
Trac: Owner: karsten toN/A Status: needs_review to assigned
Sathya, if you're interested in doing the Python part, feel free to grab the ticket. (If not, or if you're already overloaded with other stuff, that's fine, too.)
I want to hack on this :( But I don't think I can right now. If anyone else wants to do it, please feel free to take it up.
Okay, I think I got this into a needs_review state. gsathya, atagar, if you don't mind, please review the Python part and help me turn this code into a simple metrics utility library.
Here's what the code does:
The Java code combines GeoIP or ASN databases and produces a single combined database file. That file has lines like the following:
The first line means that addresses 223.255.254.0 to 223.255.254.255 were assigned to AS55415 in the databases published from 20110501 to 20121101.
The Python code reads a combined database file and answers questions to which country or AS a given IP address was assigned on a given date. It uses Python's lists and the bisect module to get at least anywhere close to Java's performance.
atagar mentions on IRC that OrderedDict might be an alternative to bisect. It's the equivalent of LinkedHashMap in Java. This does indeed look potentially more efficient.
I'm sure that you don't want to add a dependency just for this sort of functionality, but just a fyi that I have a few IP utilities that you might find to be helpful...
I haven't a clue what any of these acronyms mean. My understanding is that shortening function names to some arcane, overly cramped abbreviation is an artifact of old-time C development where saving every byte of space mattered. Mind coming up with more descriptive names?
return int(date_datetime.strftime('%s')) / 86400
It took me a second to figure out where the 86400 came from. You might want to comment that.
for line in input_file.readlines():
File objects themselves are iterable over the lines. I suspect that calling readlines() here is creating a defensive copy with a list of lines, so this causes you to read the file into memory twice (both for this list and what we add while processing it).
>>> with open('/tmp/foo') as input_file:... for line in input_file:... print line.strip()... pepperjack isvery tasty cheese!
You might want to rename this to _add_range since you don't want anyone using this method(since pygeodate is a lib) -- using this method will make self.data unsorted and miss a key from self.keys. Or you could be more safe and just remove the function since it's just two lines.
76 def load_combined_databases(self, path): 77 with open(path) as input_file: 78 for line in input_file.readlines(): 79 line = line.strip() 80 if line.startswith('!'): 81 self.add_date(line) 82 continue 83 else: 84 self.add_range(line)
I think I just made all changes you suggested, but I'm having trouble with the following:
for line in input_file.readlines():
File objects themselves are iterable over the lines. I suspect that calling readlines() here is creating a defensive copy with a list of lines, so this causes you to read the file into memory twice (both for this list and what we add while processing it).
Are you certain that the current code leads to reading the file into memory twice?
I just removed the two functions add_date and add_range, and I got rid of the continue. Thanks!
How do we proceed with turning this simple Python script into a library that can be used by #6232 (moved)? In theory, we could start by simply copying the parts we need over, but maybe there's a smarter way.
octets = address_string.split('.')
return long(''.join(["%02X" % long(octet) for octet in octets]), 16)
Nice. :)
Are you certain that the current code leads to reading the file into memory twice?
Pretty sure. What makes you think that it doesn't?
Calling 'file.readlines()' reads the whole file into a list of newline separated strings. At this point you've read the whole file once into memory, and then you iterate over the entires and append data for each entry.
This is a bit similar to the difference between python's range() and xrange() function. Calling range() gives you a list (which is iterable) while xrange() gives you an iterator. Hence...
for i in range(1000000000): print i
... means making a list of a billion ints in memory then printing each while...
for i in xrange(1000000000): print i
... has constant memory usage because xrange() provides the sequence on demand.
Personally I think that it's stupid that the python compiler isn't smart enough to say "the developer's using range() or readlines() in a loop, hence provide an iterator rather than a list", and maybe it does in newer python versions. I wouldn't count on it though.
I just removed the two functions add_date and add_range, and I got rid of the continue. Thanks!
Great!
How do we proceed with turning this simple Python script into a library that can be used by #6232 (moved)? In theory, we could start by simply copying the parts we need over, but maybe there's a smarter way.
Why not just import pygeodate? No need to copy parts. We can remove pygeoip as well.
Are you certain that the current code leads to reading the file into memory twice?
Pretty sure. What makes you think that it doesn't?
For some reason, your earlier suggestion broke for me, so I thought my approach was how it was supposed to be. Turns out that was something else, and everything works fine now. Changed. Thanks!
How do we proceed with turning this simple Python script into a library that can be used by #6232 (moved)? In theory, we could start by simply copying the parts we need over, but maybe there's a smarter way.
Why not just import pygeodate? No need to copy parts. We can remove pygeoip as well.
This is mostly a question of where the code is supposed to live, and how to make it more library-like. The idea of the metrics-tasks.git repository is to keep code of ongoing or completed projects in case we need it for future projects. Nothing else should use code from metrics-tasks.
I wonder if we should move the task-6471/ directory to metrics-utils.git as pygeodate/. What do you think about that? How's the directory layout for a Python library supposed to look like?
Trac: Status: needs_review to accepted Owner: N/Ato karsten
I wonder if we should move the task-6471/ directory to metrics-utils.git as pygeodate/. What do you think about that? How's the directory layout for a Python library supposed to look like?
Make a new pygeodate.git? We can just clone that and use it whenever. metrics-utils.git is also fine, but we'll have to clone and then delete the rest of the stuff in metrics-utils. pygeodate/pygeodate.py would be ok. /pygeodate.py would be better. The former would involve doing import pygeodate.pygeodate as pygeodate while the latter would be import pygeodate.
Make a new pygeodate.git? We can just clone that and use it whenever. metrics-utils.git is also fine, but we'll have to clone and then delete the rest of the stuff in metrics-utils. pygeodate/pygeodate.py would be ok. /pygeodate.py would be better. The former would involve doing import pygeodate.pygeodate as pygeodate while the latter would be import pygeodate.
I'd rather want to avoid creating yet another repository for this tool. I'm currently working on blockfinder together with Jake, where I want to include support for multiple GeoIP or AS databases. I plan to reuse some of the pygeodate.py code for that. This is going to take another month or two, so I suggest we copy pygeodate.py to wherever we need it, including subdirectories of metrics-tasks.git. But moving the multi-database code code to blockfinder is the better solution in the long term, because it's more likely to see maintenance updates if it's contained in a tool that we use for other purposes, too. I'm closing this ticket, because the task is done.
Trac: Status: accepted to closed Resolution: N/Ato implemented