We need to improve bulk imports of descriptor archives. Whenever somebody wants to initialize Onionoo with existing data, they'll need to process years of descriptors. The current code is not at all optimized for that, but it's designed for running once per hour and updating things as quickly as possible. Let's fix that and support bulk imports better.
Here's what we should do:
We define a new directory in/archive/ where operators can put descriptor archives fetched from CollecTor. Whenever there are files in that directory we import them first (before descriptors in in/recent/). In particular, we iterate over files twice: in the first iteration we look at the first contained descriptor to determine its type, and in the second iteration we parse files containing server descriptors and then files containing other descriptors. (This order is important for computing advertised bandwidth fractions, which only works if we parse server descriptors before consensuses.) This process will take very long, so we should log whenever we complete a tarball, and ideally we'd print out how many tarballs we already parsed and how many more we need to parse.
We add a new command-line switch --update-only for only updating status files and not downloading descriptors or writing document files. Operators could then import archives, which would take days or even weeks, and then switch to downloading and processing recent descriptors. My branch task-12651-2 is a major improvement here, because it ensures that all documents will be written once the bulk import is done, not just the ones for relays and bridges that were contained in recent descriptors. Future command-line options would be --download-only and --write-only for the other two phases and --single-run that does what's the current default but once we switch from being called by cron every hour to scheduling our own hourly runs internally.
I somewhat expect us to run into memory problems when importing months or even years of data at once. So, part of the challenge here will be to keep an eye on memory usage and fix any memory issues.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
To make the command line args easier to program, it might be helpful to use
'commons-configuration' (1.10 is available in wheezy).
Here's the documentation commons-configuration user guide 1.10.
It would be nice to have a configuration file in addition, which sort of documents the execution,
and -of course- log the given configuration. The configuration file would also avoid longish command lines.
A different aspect from the concurrent-run-task (#13003 (moved)):
I added an example for the in-java scheduling without crontab (testing it currently, execution seems fine).
Maybe, that functionality could be integrated as another option?
It doesn't change a lot and is very useful where no crontab-like program is available due
to permissions or the operating system.
(the attached file is just an example, so there are general imports and the formatting is sloppy ;-)
Noticing the recent tor-dev discussion I wanted to add here that
my onionoo mirror has been running fine and is ready to go public
except for the import of the legacy data.
I am waiting for this bulk import issue ...
Ah, great. Let me move this issue forward then. I implemented something today that's very close to the description above in branch task-13600 in my public repository. I did some local testing, but I'd recommend doing some more testing on your own and making a backup of your status/ and out/ directories before deploying that branch. Let me know if you're running into any issues there. Thanks!
There's some documentation in one of the commit messages:
New modes are:--single-run Run steps 1--3 only for a single time, then exit.--download-only Only run step 1: download recent descriptors, then exit.--update-only Only run step 2: update internal status files, then exit.--write-only Only run step 3: write output document files, then exit.Default mode is:[no argument] Run steps 1--3 repeatedly once per hour.
But please also take a look at the code before running it. Thanks!
The metrics-lib in this branch isn't up to date. When I ran some tests today using submodule init and update it installed with source and target 1.5. This causes a null pointer exception in descriptor.jar during descriptor read. It works fine with metrics-lib source and target 1.6 which I tested by symbolic link.
Thanks for trying the branch! I just rebased and pushed that branch to latest master, which also contains a recent metrics-lib reference. I'll test it on the Onionoo mirror for a few days before deploying on the main Onionoo instance. Until then, I'll leave this ticket open.
No problem. It looks like this branch and deployed Onionoo produce slightly different results when processing the same data set (recent 73h). I attach a sample (onionoo_k is this branch). I'll test some multiple archive imports on this branch.
In status:
The timestamp (?) after the country code is sometimes set to -1.
In out:
Some bandwidth documents have an extra value.
Importing multiple months: I know Onionoo can, because I tested it (testing it on this branch), but should it be encouraged? The current load on memory is rather high. If someone tries to import a year of archives at once, can the current heap dependency be guaranteed not to induce a failure. Maybe this won't be that big a deal. Just warn the operator to limit the number of months at a time until other tickets deal with the heap load. Something to add to the documentation?
Input validation: I saw metrics-lib included some packages for compressed file handling so I tried importing from .xz instead of tarball. Some validation of the input archives might be worthwhile. Bad things will happen to the log when this is attempted.
Parsing archives: Parse history doesn't include archives, and archives aren't removed after parsing. DescriptorDownloader cannot now remove the archives (current behavior) because it only considers the recent folder.
Parsing archives: If --single-run or --update-only is used with archives that have already been parsed, they will be parsed again. This leads to a change in the size of the status folder. It becomes smaller for the same number of archive-sourced files. I didn't try to determine the reason for this change at the time. I intend to revisit this potential problem to see if the same thing happens, and why. It might be interesting if the change also happens during re-processing of recent data (which may happen when restoring a backup of data).
No problem. It looks like this branch and deployed Onionoo produce slightly different results when processing the same data set (recent 73h). I attach a sample (onionoo_k is this branch). I'll test some multiple archive imports on this branch.
In status:
The timestamp (?) after the country code is sometimes set to -1.
I think this one is harmless. If you're curious, you can read more about this by reading the comment in NodeStatus starting with "This is a (possibly surprising) hack...".
In out:
Some bandwidth documents have an extra value.
This one should be harmless, too. This has to do with running the hourly updater at a later time and compressing bandwidth intervals lying farther in the past. We simply don't need the 15-minute precision anymore when we're outside of the 3-day graph interval. There would be similar compressions once we're outside the 1-week, 1-month, etc. interval.
Importing multiple months: I know Onionoo can, because I tested it (testing it on this branch), but should it be encouraged? The current load on memory is rather high. If someone tries to import a year of archives at once, can the current heap dependency be guaranteed not to induce a failure. Maybe this won't be that big a deal. Just warn the operator to limit the number of months at a time until other tickets deal with the heap load. Something to add to the documentation?
Yes, this is something we could add to the documentation. Unfortunately, reducing memory requirements enough to import multiple months or even years of descriptors is tough, because that's a very different use case from running the updater once per hour with only one hour of descriptors. When in doubt, I optimized the process in favor of the hourly update process. That's why I'd prefer to add a warning to the documentation.
Input validation: I saw metrics-lib included some packages for compressed file handling so I tried importing from .xz instead of tarball. Some validation of the input archives might be worthwhile. Bad things will happen to the log when this is attempted.
True! I just created #16424 (moved) for this to support importing .xz-compressed tarballs. In general, Onionoo is not very robust against invalid input provided by the service operator, because so far that service operator person was also the main developer. But let's try to fix that and make it more robust, if we can.
Parsing archives: Parse history doesn't include archives, and archives aren't removed after parsing. DescriptorDownloader cannot now remove the archives (current behavior) because it only considers the recent folder.
Oh, I don't think Onionoo should remove tarballs from the archive directory after parsing them, because it didn't place them there beforehand. What we could do, however, is add a parse history for files in the archive directory; see the newly created #16426 (moved).
Parsing archives: If --single-run or --update-only is used with archives that have already been parsed, they will be parsed again. This leads to a change in the size of the status folder. It becomes smaller for the same number of archive-sourced files. I didn't try to determine the reason for this change at the time. I intend to revisit this potential problem to see if the same thing happens, and why. It might be interesting if the change also happens during re-processing of recent data (which may happen when restoring a backup of data).
It would be interesting to learn more about that directory becoming smaller. For now, I'll assume it's related to the differences stated above. But if you spot an actual bug there, please mention it here or open a new ticket.
Thanks for trying this out and sending feedback here!
Thank you for clearing up the slight differences mentioned. I was hoping those were minor. There were other differences, but they were clearly trivial (like omission of rdns, or use of ip for unresolved rdns). I'll take a look at the code again in NodeStatus.
Input validation: Excellent, I was thinking this too! If extra validation is going to be performed, it's also worth checking out streaming data from the archives directly. I suspect this will be to a significant advantage, as it will no longer be needed to take up extra space for the uncompressed tarball.
Parsing archives: Sounds good. I was thinking of at least warning the operator about an accumulation of archives, but with #16424 (moved) this isn't as much of a problem.
Importing multiple months: I was testing this together with looking into reproducing the smaller directory for parsed data. I got the out-of-memory-heap error while using --update-only with two months. It occurred at approx. 80% (based on time), during consensus parsing (based on stack trace). So parsing is itself very sensitive to heap memory. I have some thoughts on how to solve this. Besides the disk-based data structures to reduce heap dependency, I'll take a look again at metrics-lib to see if it can benefit from lexer-parser improvements. The heap dependency during parse could be reduced, while increasing ease-of-maintenance, by using a grammar-based recognizer, streaming reads (from archives), and lock-free (cas) lists. It creates a parse-stage that scales to I/O if done right. Combines parse and write, reducing heap requirement.
Parsing archives: Due to the out-of-memory error I restarted this test using a smaller data set. I also hope it's harmless, but having seen it I don't want to rule it out unless provable. I'll notify you here once I know for sure.
Sorry for the delay. I've now checked re-parsing identical data with the following results.
Reprocessing data rewrites the timestamp of last_changed_or_address_or_port from a valid value to -1.
The host_name key gets removed. I did see this before. I believe this depends on the rdns resolving component failing. I use a system tor for dns resolution so this makes sense (and sounds mostly harmless).
That's it, besides the artifacts mentioned previously. This all combines to produce smaller data stores after reprocessing. If rewriting last_changed_or_address_or_port sounds harmless too then I think this ticket can be closed now that the branch was merged. The rest can be handled in their own tickets.
Unless, would you prefer to deal with the command-line parsing before closing?
Sorry for the delay. I've now checked re-parsing identical data with the following results.
Reprocessing data rewrites the timestamp of last_changed_or_address_or_port from a valid value to -1.
Interesting. I only observed the other way around; which is not good either. There is a bug here that we need to fix. I'm listing this below, so that we don't forget about it.
The host_name key gets removed. I did see this before. I believe this depends on the rdns resolving component failing. I use a system tor for dns resolution so this makes sense (and sounds mostly harmless).
Makes sense. This is not a bug, I think.
That's it, besides the artifacts mentioned previously. This all combines to produce smaller data stores after reprocessing. If rewriting last_changed_or_address_or_port sounds harmless too then I think this ticket can be closed now that the branch was merged. The rest can be handled in their own tickets.
Not harmless, I'm afraid. See the list below.
Unless, would you prefer to deal with the command-line parsing before closing?
That can happen in its own ticket.
So, I finally reproduced some of these issues and discovered quite a few more. I'm listing all issues here, so that we can either fix them in this ticket or open new tickets for some or all of them. I assume some of these are closely related, which is why I didn't open new tickets just yet.
status/summary-noah r mer <memeticpox at gmail dot com>+noah r?mer <memeticpox at gmail dot com>-0x2b3fc09b375594c0 sebastian m ki <sebastian@tico.fi> - 1j7fbivh6kf8ujsgp23fej4knms3x5px1v+0x2b3fc09b375594c0 sebastian m?ki <sebastian@tico.fi> - 1j7fbivh6kf8ujsgp23fej4knms3x5px1v