Opened 4 years ago

Last modified 10 months ago

#13600 assigned enhancement

Improve bulk imports of descriptor archives

Reported by: karsten Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Onionoo Version:
Severity: Normal Keywords: metrics-2018
Cc: iwakeh Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

We need to improve bulk imports of descriptor archives. Whenever somebody wants to initialize Onionoo with existing data, they'll need to process years of descriptors. The current code is not at all optimized for that, but it's designed for running once per hour and updating things as quickly as possible. Let's fix that and support bulk imports better.

Here's what we should do:

  • We define a new directory in/archive/ where operators can put descriptor archives fetched from CollecTor. Whenever there are files in that directory we import them first (before descriptors in in/recent/). In particular, we iterate over files twice: in the first iteration we look at the first contained descriptor to determine its type, and in the second iteration we parse files containing server descriptors and then files containing other descriptors. (This order is important for computing advertised bandwidth fractions, which only works if we parse server descriptors before consensuses.) This process will take very long, so we should log whenever we complete a tarball, and ideally we'd print out how many tarballs we already parsed and how many more we need to parse.
  • We add a new command-line switch --update-only for only updating status files and not downloading descriptors or writing document files. Operators could then import archives, which would take days or even weeks, and then switch to downloading and processing recent descriptors. My branch task-12651-2 is a major improvement here, because it ensures that all documents will be written once the bulk import is done, not just the ones for relays and bridges that were contained in recent descriptors. Future command-line options would be --download-only and --write-only for the other two phases and --single-run that does what's the current default but once we switch from being called by cron every hour to scheduling our own hourly runs internally.

I somewhat expect us to run into memory problems when importing months or even years of data at once. So, part of the challenge here will be to keep an eye on memory usage and fix any memory issues.

Child Tickets

Attachments (3)

Main.java (3.1 KB) - added by iwakeh 4 years ago.
example based on task-13089
out_sample (13.0 KB) - added by leeroy 3 years ago.
status_sample (6.1 KB) - added by leeroy 3 years ago.

Download all attachments as: .zip

Change History (24)

comment:1 Changed 4 years ago by iwakeh

To make the command line args easier to program, it might be helpful to use
'commons-configuration' (1.10 is available in wheezy).
Here's the documentation commons-configuration user guide 1.10.

It would be nice to have a configuration file in addition, which sort of documents the execution,
and -of course- log the given configuration. The configuration file would also avoid longish command lines.

A different aspect from the concurrent-run-task (#13003):
I added an example for the in-java scheduling without crontab (testing it currently, execution seems fine).

Maybe, that functionality could be integrated as another option?
It doesn't change a lot and is very useful where no crontab-like program is available due
to permissions or the operating system.

(the attached file is just an example, so there are general imports and the formatting is sloppy ;-)

Last edited 4 years ago by iwakeh (previous) (diff)

Changed 4 years ago by iwakeh

Attachment: Main.java added

example based on task-13089

comment:2 Changed 4 years ago by iwakeh

And, for the command line configuration commons-cli (wheezy provides 1.2).

comment:3 Changed 3 years ago by iwakeh

Noticing the recent tor-dev discussion I wanted to add here that
my onionoo mirror has been running fine and is ready to go public
except for the import of the legacy data.
I am waiting for this bulk import issue ...

comment:4 Changed 3 years ago by karsten

Ah, great. Let me move this issue forward then. I implemented something today that's very close to the description above in branch task-13600 in my public repository. I did some local testing, but I'd recommend doing some more testing on your own and making a backup of your status/ and out/ directories before deploying that branch. Let me know if you're running into any issues there. Thanks!

comment:5 Changed 3 years ago by iwakeh

Great to have the import tool now!

I might get around to starting tests during the next days.

Is there some kind of documentation about current features, prerequisites, and command line switches?
Thanks a lot!

comment:6 Changed 3 years ago by karsten

There's some documentation in one of the commit messages:

New modes are:

--single-run     Run steps 1--3 only for a single time, then exit.
--download-only  Only run step 1: download recent descriptors, then
                 exit.
--update-only    Only run step 2: update internal status files, then
                 exit.
--write-only     Only run step 3: write output document files, then
                 exit.

Default mode is:

[no argument]    Run steps 1--3 repeatedly once per hour.

But please also take a look at the code before running it. Thanks!

comment:7 Changed 3 years ago by leeroy

The metrics-lib in this branch isn't up to date. When I ran some tests today using submodule init and update it installed with source and target 1.5. This causes a null pointer exception in descriptor.jar during descriptor read. It works fine with metrics-lib source and target 1.6 which I tested by symbolic link.

Last edited 3 years ago by leeroy (previous) (diff)

comment:8 Changed 3 years ago by karsten

Thanks for trying the branch! I just rebased and pushed that branch to latest master, which also contains a recent metrics-lib reference. I'll test it on the Onionoo mirror for a few days before deploying on the main Onionoo instance. Until then, I'll leave this ticket open.

comment:9 Changed 3 years ago by leeroy

No problem. It looks like this branch and deployed Onionoo produce slightly different results when processing the same data set (recent 73h). I attach a sample (onionoo_k is this branch). I'll test some multiple archive imports on this branch.

In status:

  • The timestamp (?) after the country code is sometimes set to -1.

In out:

  • Some bandwidth documents have an extra value.

Importing multiple months: I know Onionoo can, because I tested it (testing it on this branch), but should it be encouraged? The current load on memory is rather high. If someone tries to import a year of archives at once, can the current heap dependency be guaranteed not to induce a failure. Maybe this won't be that big a deal. Just warn the operator to limit the number of months at a time until other tickets deal with the heap load. Something to add to the documentation?

Input validation: I saw metrics-lib included some packages for compressed file handling so I tried importing from .xz instead of tarball. Some validation of the input archives might be worthwhile. Bad things will happen to the log when this is attempted.

Parsing archives: Parse history doesn't include archives, and archives aren't removed after parsing. DescriptorDownloader cannot now remove the archives (current behavior) because it only considers the recent folder.

Parsing archives: If --single-run or --update-only is used with archives that have already been parsed, they will be parsed again. This leads to a change in the size of the status folder. It becomes smaller for the same number of archive-sourced files. I didn't try to determine the reason for this change at the time. I intend to revisit this potential problem to see if the same thing happens, and why. It might be interesting if the change also happens during re-processing of recent data (which may happen when restoring a backup of data).

Last edited 3 years ago by leeroy (previous) (diff)

Changed 3 years ago by leeroy

Attachment: out_sample added

Changed 3 years ago by leeroy

Attachment: status_sample added

comment:10 in reply to:  9 Changed 3 years ago by karsten

Replying to leeroy:

No problem. It looks like this branch and deployed Onionoo produce slightly different results when processing the same data set (recent 73h). I attach a sample (onionoo_k is this branch). I'll test some multiple archive imports on this branch.

In status:

  • The timestamp (?) after the country code is sometimes set to -1.

I think this one is harmless. If you're curious, you can read more about this by reading the comment in NodeStatus starting with "This is a (possibly surprising) hack...".

In out:

  • Some bandwidth documents have an extra value.

This one should be harmless, too. This has to do with running the hourly updater at a later time and compressing bandwidth intervals lying farther in the past. We simply don't need the 15-minute precision anymore when we're outside of the 3-day graph interval. There would be similar compressions once we're outside the 1-week, 1-month, etc. interval.

Importing multiple months: I know Onionoo can, because I tested it (testing it on this branch), but should it be encouraged? The current load on memory is rather high. If someone tries to import a year of archives at once, can the current heap dependency be guaranteed not to induce a failure. Maybe this won't be that big a deal. Just warn the operator to limit the number of months at a time until other tickets deal with the heap load. Something to add to the documentation?

Yes, this is something we could add to the documentation. Unfortunately, reducing memory requirements enough to import multiple months or even years of descriptors is tough, because that's a very different use case from running the updater once per hour with only one hour of descriptors. When in doubt, I optimized the process in favor of the hourly update process. That's why I'd prefer to add a warning to the documentation.

Input validation: I saw metrics-lib included some packages for compressed file handling so I tried importing from .xz instead of tarball. Some validation of the input archives might be worthwhile. Bad things will happen to the log when this is attempted.

True! I just created #16424 for this to support importing .xz-compressed tarballs. In general, Onionoo is not very robust against invalid input provided by the service operator, because so far that service operator person was also the main developer. But let's try to fix that and make it more robust, if we can.

Parsing archives: Parse history doesn't include archives, and archives aren't removed after parsing. DescriptorDownloader cannot now remove the archives (current behavior) because it only considers the recent folder.

Oh, I don't think Onionoo should remove tarballs from the archive directory after parsing them, because it didn't place them there beforehand. What we could do, however, is add a parse history for files in the archive directory; see the newly created #16426.

Parsing archives: If --single-run or --update-only is used with archives that have already been parsed, they will be parsed again. This leads to a change in the size of the status folder. It becomes smaller for the same number of archive-sourced files. I didn't try to determine the reason for this change at the time. I intend to revisit this potential problem to see if the same thing happens, and why. It might be interesting if the change also happens during re-processing of recent data (which may happen when restoring a backup of data).

It would be interesting to learn more about that directory becoming smaller. For now, I'll assume it's related to the differences stated above. But if you spot an actual bug there, please mention it here or open a new ticket.

Thanks for trying this out and sending feedback here!

comment:11 Changed 3 years ago by leeroy

Thank you for clearing up the slight differences mentioned. I was hoping those were minor. There were other differences, but they were clearly trivial (like omission of rdns, or use of ip for unresolved rdns). I'll take a look at the code again in NodeStatus.

Input validation: Excellent, I was thinking this too! If extra validation is going to be performed, it's also worth checking out streaming data from the archives directly. I suspect this will be to a significant advantage, as it will no longer be needed to take up extra space for the uncompressed tarball.

Parsing archives: Sounds good. I was thinking of at least warning the operator about an accumulation of archives, but with #16424 this isn't as much of a problem.

Importing multiple months: I was testing this together with looking into reproducing the smaller directory for parsed data. I got the out-of-memory-heap error while using --update-only with two months. It occurred at approx. 80% (based on time), during consensus parsing (based on stack trace). So parsing is itself very sensitive to heap memory. I have some thoughts on how to solve this. Besides the disk-based data structures to reduce heap dependency, I'll take a look again at metrics-lib to see if it can benefit from lexer-parser improvements. The heap dependency during parse could be reduced, while increasing ease-of-maintenance, by using a grammar-based recognizer, streaming reads (from archives), and lock-free (cas) lists. It creates a parse-stage that scales to I/O if done right. Combines parse and write, reducing heap requirement.

Parsing archives: Due to the out-of-memory error I restarted this test using a smaller data set. I also hope it's harmless, but having seen it I don't want to rule it out unless provable. I'll notify you here once I know for sure.

 

comment:12 Changed 3 years ago by leeroy

Sorry for the delay. I've now checked re-parsing identical data with the following results.

  • Reprocessing data rewrites the timestamp of last_changed_or_address_or_port from a valid value to -1.
  • The host_name key gets removed. I did see this before. I believe this depends on the rdns resolving component failing. I use a system tor for dns resolution so this makes sense (and sounds mostly harmless).

That's it, besides the artifacts mentioned previously. This all combines to produce smaller data stores after reprocessing. If rewriting last_changed_or_address_or_port sounds harmless too then I think this ticket can be closed now that the branch was merged. The rest can be handled in their own tickets.

Unless, would you prefer to deal with the command-line parsing before closing?

Last edited 3 years ago by leeroy (previous) (diff)

comment:13 in reply to:  12 Changed 3 years ago by karsten

Replying to leeroy:

Sorry for the delay. I've now checked re-parsing identical data with the following results.

  • Reprocessing data rewrites the timestamp of last_changed_or_address_or_port from a valid value to -1.

Interesting. I only observed the other way around; which is not good either. There is a bug here that we need to fix. I'm listing this below, so that we don't forget about it.

  • The host_name key gets removed. I did see this before. I believe this depends on the rdns resolving component failing. I use a system tor for dns resolution so this makes sense (and sounds mostly harmless).

Makes sense. This is not a bug, I think.

That's it, besides the artifacts mentioned previously. This all combines to produce smaller data stores after reprocessing. If rewriting last_changed_or_address_or_port sounds harmless too then I think this ticket can be closed now that the branch was merged. The rest can be handled in their own tickets.

Not harmless, I'm afraid. See the list below.

Unless, would you prefer to deal with the command-line parsing before closing?

That can happen in its own ticket.

So, I finally reproduced some of these issues and discovered quite a few more. I'm listing all issues here, so that we can either fix them in this ticket or open new tickets for some or all of them. I assume some of these are closely related, which is why I didn't open new tickets just yet.

1. Bandwidth statuses contain overlapping intervals

status/bandwidth/0/0/0011BD2485AD45D984EC4159C88FC066E5E3300E
 w 2015-07-04 20:56:24 2015-07-04 21:56:24 6862711808
 w 2015-07-04 21:56:24 2015-07-04 22:56:24 8487956480
+w 2015-07-04 22:11:24 2015-07-04 22:56:24 5816468480
 w 2015-07-04 22:56:24 2015-07-04 23:56:24 7400038400

 w 2015-07-05 15:56:24 2015-07-05 16:56:24 8741448704
-w 2015-07-05 16:11:24 2015-07-05 16:56:24 6629600256
 w 2015-07-05 16:56:24 2015-07-05 17:56:24 8642562048

 w 2015-07-05 20:56:24 2015-07-05 21:56:24 10966376448
-w 2015-07-05 21:56:24 2015-07-05 22:56:24 9769795584
+w 2015-07-05 21:56:24 2015-07-05 22:11:24 2572811264
 w 2015-07-05 22:56:24 2015-07-05 23:56:24 7191084032

 r 2015-07-04 21:56:24 2015-07-04 22:56:24 8327266304
+r 2015-07-04 22:11:24 2015-07-04 22:56:24 5703036928
 r 2015-07-04 22:56:24 2015-07-04 23:56:24 7253635072

2. Relay flags contain empty string

status/details/B/A/BA2067E5ACA2417EA3DF3D883CCD411DCE79A4E0
-"relay_flags":[]
+"relay_flags":[""]

3. Last changed OR address or port differs

status/details/F/D/FD2836E402083EAF1E40635EC6EBD4CF83126988
-"last_changed_or_address_or_port":-1
+"last_changed_or_address_or_port":1436405873000

status/details/F/C/FC75ECCBB64F3786B079D78F52F33E9A00529C2B
-"last_changed_or_address_or_port":1436349600000
+"last_changed_or_address_or_port":1435708800000

See also status/summary with node statuses.

4. Middle probability differs

status/details/F/F/FF89E8901B433F2546F3E705B44918CA1F33F541
-"middle_probability":7.0548704E-6
+"middle_probability":7.2019566E-6

5. Consensus weight differs

F4E92F76C532F407968C2FA5396AE4E30064D418
-"consensus_weight":0
+"consensus_weight":-1

6. Contact differs

status/summary
-noah r mer <memeticpox at gmail dot com>
+noah r?mer <memeticpox at gmail dot com>

-0x2b3fc09b375594c0 sebastian m ki <sebastian@tico.fi> - 1j7fbivh6kf8ujsgp23fej4knms3x5px1v
+0x2b3fc09b375594c0 sebastian m?ki <sebastian@tico.fi> - 1j7fbivh6kf8ujsgp23fej4knms3x5px1v

comment:14 Changed 3 years ago by leeroy

Would #16426 help here? If the archives aren't reprocessed none of the changes can happen. (When processing recent data, utf-8 tagged content gets replaced with ?. This happens even without reprocessing.)

The current history file mechanism does have one drawback. If an archive is removed and put back (some other time) it'll be treated as new data. So, never put the archives back, only remove after importing.

Last edited 3 years ago by leeroy (previous) (diff)

comment:15 Changed 3 years ago by leeroy

I have tracked some of these bugs to NodeStatus,BandwidthStatus. In particular { 1, 2, 3, 5 } above. These bugs occur anytime NodeStatus, BandwidthStatus are used on descriptors that already existed during StatusUpdateRunner. The changes appear not to occur during reprocessing. However, if you import a descriptor from archives, and then update it somehow, then the changes occur.

Number 4 is a computed value which I've not yet attempted to isolate.

The replacement of strings containing ? from number 6 appears due to improper handling of embedded utf-8 content. This happens anytime data is processed, irregardless of reprocessing. Initially the files are written correctly (during read), but by the end of StatusUpdateRunner have been altered again.

(Also worth mentioning is that currently Onionoo is serving data containing these artifacts. This wasn't the case a couple months ago. The UTF-8 tagged content has been replaced with ?. The good news is that because this change is introduced during the processing of recent data it should be self-correcting for recent data once patched. That leaves older entries uncorrected though. It also doesn't answer what shall be done about the overlapping intervals of bandwidth statuses. Maybe a maintenance option for Onionoo to correct it.)

Last edited 3 years ago by leeroy (previous) (diff)

comment:16 in reply to:  4 Changed 3 years ago by iwakeh

@karsten

This issue has a long comments listing now.
So, before drawing any wrong conclusions, I rather ask:

  • is the import tools status ok so far for trying imports again? Ifso, which branch and repo should I use?
  • is any additional revision needed?
  • or even more coding necessary?

Thanks!

comment:17 Changed 3 years ago by karsten

@iwakeh

I don't have a good answer for you, because I didn't have the chance to go through all comments on this ticket and the others yet. But if I were to re-import descriptor archives into a new Onionoo instance, I'd do the following:

  • Use latest master of the official repository, nothing else.
  • Decompress (but not extract) tarballs using unxz.
  • Start with importing a single tarball or all tarballs of a single month, then try with three months, then twelve, etc. You'll probably run into out-of-memory problems at some point, and you'll have to find out how many tarballs you can process at once. Keep in mind that tarballs got bigger and bigger over time.
  • Once an import run completes, move away tarballs, because otherwise they will be re-imported.
  • Make backups of the status/ directory after each import run.

Sorry that this is not as convenient as it should be.

comment:18 Changed 3 years ago by leeroy

Keep in mind that unpatched master will introduce update artifacts as mentioned. It's not going to be a guaranteed fix to just remove the archive after parsing once. What if you import the latest month's archive, then import recent? (You get the same problems). That being said....

You can expect the numbers quoted for time, space to still be accurate. The implementation imports by the archive anyway so you might as well make your life easy and import by the month. Master doesn't make any mitigation against losses on failure. Unless you want to redo the results (in particular the ones which try to create memory error), then go ahead. I see no benefit though. The implementation is heap based. We know this already.

Speaking of loss mitigation. Writing a history file and writing it early (as opposed to the end of the entire updater run, which may be days). If this sounds useful see #16426 (which requires #16540). This *shouldn't* been seen as a fix for the update artifacts mentioned above.

Besides that you might also find #16612 useful (which requires #16424).

Be careful during testing with backups and your permissions. If incorrectly set they can wreak havoc on your logs. See #16401 for that.

If you intend to test archive imports your best bet is to craft the test archive yourself.

I hope that clears things up. Let me know if you need further details.

(Most important result from other tickets)--

  • 12+ hours of import time per month of archives
  • 4GB+ ram over time for updater. if you restart the updater after (ie. periodic mode), you easily run using < 1GB ram if your update interval is << 73 hour (73h requires ~2GB)
Last edited 3 years ago by leeroy (previous) (diff)

comment:19 Changed 12 months ago by karsten

Keywords: metrics-2018 added

comment:20 Changed 12 months ago by karsten

Owner: set to metrics-team
Status: newassigned

comment:21 Changed 10 months ago by teor

Severity: Normal

Set all open tickets without a severity to "Normal"

Note: See TracTickets for help on using tickets.