Improve bulk imports of descriptor archives

added component::metrics/onionoo metrics-2018 owner::metrics-team priority::medium resolution::worksforme severity::normal status::closed type::enhancement labels

To make the command line args easier to program, it might be helpful to use 'commons-configuration' (1.10 is available in wheezy). Here's the documentation commons-configuration user guide 1.10.

It would be nice to have a configuration file in addition, which sort of documents the execution, and -of course- log the given configuration. The configuration file would also avoid longish command lines.

A different aspect from the concurrent-run-task (#13003 (moved)): I added an example for the in-java scheduling without crontab (testing it currently, execution seems fine).

Maybe, that functionality could be integrated as another option? It doesn't change a lot and is very useful where no crontab-like program is available due to permissions or the operating system.

(the attached file is just an example, so there are general imports and the formatting is sloppy ;-)

Trac:
Main.java

example based on task-13089

And, for the command line configuration commons-cli (wheezy provides 1.2).

Noticing the recent tor-dev discussion I wanted to add here that my onionoo mirror has been running fine and is ready to go public except for the import of the legacy data. I am waiting for this bulk import issue ...

Ah, great. Let me move this issue forward then. I implemented something today that's very close to the description above in branch task-13600 in my public repository. I did some local testing, but I'd recommend doing some more testing on your own and making a backup of your status/ and out/ directories before deploying that branch. Let me know if you're running into any issues there. Thanks!

Great to have the import tool now!

I might get around to starting tests during the next days.

Is there some kind of documentation about current features, prerequisites, and command line switches? Thanks a lot!

There's some documentation in one of the commit messages:

New modes are:

--single-run     Run steps 1--3 only for a single time, then exit.
--download-only  Only run step 1: download recent descriptors, then
                 exit.
--update-only    Only run step 2: update internal status files, then
                 exit.
--write-only     Only run step 3: write output document files, then
                 exit.

Default mode is:

[no argument]    Run steps 1--3 repeatedly once per hour.

But please also take a look at the code before running it. Thanks!

The metrics-lib in this branch isn't up to date. When I ran some tests today using submodule init and update it installed with source and target 1.5. This causes a null pointer exception in descriptor.jar during descriptor read. It works fine with metrics-lib source and target 1.6 which I tested by symbolic link.

Trac:
Username: leeroy

Thanks for trying the branch! I just rebased and pushed that branch to latest master, which also contains a recent metrics-lib reference. I'll test it on the Onionoo mirror for a few days before deploying on the main Onionoo instance. Until then, I'll leave this ticket open.

No problem. It looks like this branch and deployed Onionoo produce slightly different results when processing the same data set (recent 73h). I attach a sample (onionoo_k is this branch). I'll test some multiple archive imports on this branch.

In status:

The timestamp (?) after the country code is sometimes set to -1.

In out:

Some bandwidth documents have an extra value.

Importing multiple months: I know Onionoo can, because I tested it (testing it on this branch), but should it be encouraged? The current load on memory is rather high. If someone tries to import a year of archives at once, can the current heap dependency be guaranteed not to induce a failure. Maybe this won't be that big a deal. Just warn the operator to limit the number of months at a time until other tickets deal with the heap load. Something to add to the documentation?

Input validation: I saw metrics-lib included some packages for compressed file handling so I tried importing from .xz instead of tarball. Some validation of the input archives might be worthwhile. Bad things will happen to the log when this is attempted.

Parsing archives: Parse history doesn't include archives, and archives aren't removed after parsing. DescriptorDownloader cannot now remove the archives (current behavior) because it only considers the recent folder.

Parsing archives: If --single-run or --update-only is used with archives that have already been parsed, they will be parsed again. This leads to a change in the size of the status folder. It becomes smaller for the same number of archive-sourced files. I didn't try to determine the reason for this change at the time. I intend to revisit this potential problem to see if the same thing happens, and why. It might be interesting if the change also happens during re-processing of recent data (which may happen when restoring a backup of data).

Trac:
Username: leeroy

out_sample

Trac:
Username: leeroy

status_sample

Trac:
Username: leeroy

Replying to leeroy:

No problem. It looks like this branch and deployed Onionoo produce slightly different results when processing the same data set (recent 73h). I attach a sample (onionoo_k is this branch). I'll test some multiple archive imports on this branch.

In status:

The timestamp (?) after the country code is sometimes set to -1.

I think this one is harmless. If you're curious, you can read more about this by reading the comment in NodeStatus starting with "This is a (possibly surprising) hack...".

In out:

Some bandwidth documents have an extra value.

This one should be harmless, too. This has to do with running the hourly updater at a later time and compressing bandwidth intervals lying farther in the past. We simply don't need the 15-minute precision anymore when we're outside of the 3-day graph interval. There would be similar compressions once we're outside the 1-week, 1-month, etc. interval.

Importing multiple months: I know Onionoo can, because I tested it (testing it on this branch), but should it be encouraged? The current load on memory is rather high. If someone tries to import a year of archives at once, can the current heap dependency be guaranteed not to induce a failure. Maybe this won't be that big a deal. Just warn the operator to limit the number of months at a time until other tickets deal with the heap load. Something to add to the documentation?

Yes, this is something we could add to the documentation. Unfortunately, reducing memory requirements enough to import multiple months or even years of descriptors is tough, because that's a very different use case from running the updater once per hour with only one hour of descriptors. When in doubt, I optimized the process in favor of the hourly update process. That's why I'd prefer to add a warning to the documentation.

Input validation: I saw metrics-lib included some packages for compressed file handling so I tried importing from .xz instead of tarball. Some validation of the input archives might be worthwhile. Bad things will happen to the log when this is attempted.

True! I just created #16424 (moved) for this to support importing .xz-compressed tarballs. In general, Onionoo is not very robust against invalid input provided by the service operator, because so far that service operator person was also the main developer. But let's try to fix that and make it more robust, if we can.

Parsing archives: Parse history doesn't include archives, and archives aren't removed after parsing. DescriptorDownloader cannot now remove the archives (current behavior) because it only considers the recent folder.

Oh, I don't think Onionoo should remove tarballs from the archive directory after parsing them, because it didn't place them there beforehand. What we could do, however, is add a parse history for files in the archive directory; see the newly created #16426 (moved).

Parsing archives: If --single-run or --update-only is used with archives that have already been parsed, they will be parsed again. This leads to a change in the size of the status folder. It becomes smaller for the same number of archive-sourced files. I didn't try to determine the reason for this change at the time. I intend to revisit this potential problem to see if the same thing happens, and why. It might be interesting if the change also happens during re-processing of recent data (which may happen when restoring a backup of data).

It would be interesting to learn more about that directory becoming smaller. For now, I'll assume it's related to the differences stated above. But if you spot an actual bug there, please mention it here or open a new ticket.

Thanks for trying this out and sending feedback here!

Thank you for clearing up the slight differences mentioned. I was hoping those were minor. There were other differences, but they were clearly trivial (like omission of rdns, or use of ip for unresolved rdns). I'll take a look at the code again in NodeStatus.

Input validation: Excellent, I was thinking this too! If extra validation is going to be performed, it's also worth checking out streaming data from the archives directly. I suspect this will be to a significant advantage, as it will no longer be needed to take up extra space for the uncompressed tarball.

Parsing archives: Sounds good. I was thinking of at least warning the operator about an accumulation of archives, but with #16424 (moved) this isn't as much of a problem.

Importing multiple months: I was testing this together with looking into reproducing the smaller directory for parsed data. I got the out-of-memory-heap error while using --update-only with two months. It occurred at approx. 80% (based on time), during consensus parsing (based on stack trace). So parsing is itself very sensitive to heap memory. I have some thoughts on how to solve this. Besides the disk-based data structures to reduce heap dependency, I'll take a look again at metrics-lib to see if it can benefit from lexer-parser improvements. The heap dependency during parse could be reduced, while increasing ease-of-maintenance, by using a grammar-based recognizer, streaming reads (from archives), and lock-free (cas) lists. It creates a parse-stage that scales to I/O if done right. Combines parse and write, reducing heap requirement.

Parsing archives: Due to the out-of-memory error I restarted this test using a smaller data set. I also hope it's harmless, but having seen it I don't want to rule it out unless provable. I'll notify you here once I know for sure.

Trac:
Username: leeroy

Sorry for the delay. I've now checked re-parsing identical data with the following results.

Reprocessing data rewrites the timestamp of last_changed_or_address_or_port from a valid value to -1.
The host_name key gets removed. I did see this before. I believe this depends on the rdns resolving component failing. I use a system tor for dns resolution so this makes sense (and sounds mostly harmless).

That's it, besides the artifacts mentioned previously. This all combines to produce smaller data stores after reprocessing. If rewriting last_changed_or_address_or_port sounds harmless too then I think this ticket can be closed now that the branch was merged. The rest can be handled in their own tickets.

Unless, would you prefer to deal with the command-line parsing before closing?

Trac:
Username: leeroy

Replying to leeroy:

Sorry for the delay. I've now checked re-parsing identical data with the following results.

Reprocessing data rewrites the timestamp of last_changed_or_address_or_port from a valid value to -1.

Interesting. I only observed the other way around; which is not good either. There is a bug here that we need to fix. I'm listing this below, so that we don't forget about it.

The host_name key gets removed. I did see this before. I believe this depends on the rdns resolving component failing. I use a system tor for dns resolution so this makes sense (and sounds mostly harmless).

Makes sense. This is not a bug, I think.

That's it, besides the artifacts mentioned previously. This all combines to produce smaller data stores after reprocessing. If rewriting last_changed_or_address_or_port sounds harmless too then I think this ticket can be closed now that the branch was merged. The rest can be handled in their own tickets.

Not harmless, I'm afraid. See the list below.

Unless, would you prefer to deal with the command-line parsing before closing?

That can happen in its own ticket.

So, I finally reproduced some of these issues and discovered quite a few more. I'm listing all issues here, so that we can either fix them in this ticket or open new tickets for some or all of them. I assume some of these are closely related, which is why I didn't open new tickets just yet.

1. Bandwidth statuses contain overlapping intervals

status/bandwidth/0/0/0011BD2485AD45D984EC4159C88FC066E5E3300E
 w 2015-07-04 20:56:24 2015-07-04 21:56:24 6862711808
 w 2015-07-04 21:56:24 2015-07-04 22:56:24 8487956480
+w 2015-07-04 22:11:24 2015-07-04 22:56:24 5816468480
 w 2015-07-04 22:56:24 2015-07-04 23:56:24 7400038400

 w 2015-07-05 15:56:24 2015-07-05 16:56:24 8741448704
-w 2015-07-05 16:11:24 2015-07-05 16:56:24 6629600256
 w 2015-07-05 16:56:24 2015-07-05 17:56:24 8642562048

 w 2015-07-05 20:56:24 2015-07-05 21:56:24 10966376448
-w 2015-07-05 21:56:24 2015-07-05 22:56:24 9769795584
+w 2015-07-05 21:56:24 2015-07-05 22:11:24 2572811264
 w 2015-07-05 22:56:24 2015-07-05 23:56:24 7191084032

 r 2015-07-04 21:56:24 2015-07-04 22:56:24 8327266304
+r 2015-07-04 22:11:24 2015-07-04 22:56:24 5703036928
 r 2015-07-04 22:56:24 2015-07-04 23:56:24 7253635072

2. Relay flags contain empty string

status/details/B/A/BA2067E5ACA2417EA3DF3D883CCD411DCE79A4E0
-"relay_flags":[]
+"relay_flags":[""]

3. Last changed OR address or port differs

status/details/F/D/FD2836E402083EAF1E40635EC6EBD4CF83126988
-"last_changed_or_address_or_port":-1
+"last_changed_or_address_or_port":1436405873000

status/details/F/C/FC75ECCBB64F3786B079D78F52F33E9A00529C2B
-"last_changed_or_address_or_port":1436349600000
+"last_changed_or_address_or_port":1435708800000

See also status/summary with node statuses.

4. Middle probability differs

status/details/F/F/FF89E8901B433F2546F3E705B44918CA1F33F541
-"middle_probability":7.0548704E-6
+"middle_probability":7.2019566E-6

5. Consensus weight differs

F4E92F76C532F407968C2FA5396AE4E30064D418
-"consensus_weight":0
+"consensus_weight":-1

6. Contact differs

status/summary
-noah r mer <memeticpox at gmail dot com>
+noah r?mer <memeticpox at gmail dot com>

-0x2b3fc09b375594c0 sebastian m ki <sebastian@tico.fi> - 1j7fbivh6kf8ujsgp23fej4knms3x5px1v
+0x2b3fc09b375594c0 sebastian m?ki <sebastian@tico.fi> - 1j7fbivh6kf8ujsgp23fej4knms3x5px1v

Improve bulk imports of descriptor archives

Child items ...

Activity