Improve bulk imports of descriptor archives
We need to improve bulk imports of descriptor archives. Whenever somebody wants to initialize Onionoo with existing data, they'll need to process years of descriptors. The current code is not at all optimized for that, but it's designed for running once per hour and updating things as quickly as possible. Let's fix that and support bulk imports better.
Here's what we should do:
- We define a new directory
in/archive/
where operators can put descriptor archives fetched from CollecTor. Whenever there are files in that directory we import them first (before descriptors inin/recent/
). In particular, we iterate over files twice: in the first iteration we look at the first contained descriptor to determine its type, and in the second iteration we parse files containing server descriptors and then files containing other descriptors. (This order is important for computing advertised bandwidth fractions, which only works if we parse server descriptors before consensuses.) This process will take very long, so we should log whenever we complete a tarball, and ideally we'd print out how many tarballs we already parsed and how many more we need to parse. - We add a new command-line switch
--update-only
for only updating status files and not downloading descriptors or writing document files. Operators could then import archives, which would take days or even weeks, and then switch to downloading and processing recent descriptors. My branch task-12651-2 is a major improvement here, because it ensures that all documents will be written once the bulk import is done, not just the ones for relays and bridges that were contained in recent descriptors. Future command-line options would be--download-only
and--write-only
for the other two phases and--single-run
that does what's the current default but once we switch from being called by cron every hour to scheduling our own hourly runs internally.
I somewhat expect us to run into memory problems when importing months or even years of data at once. So, part of the challenge here will be to keep an eye on memory usage and fix any memory issues.