Bridge descriptor sanitizer runs out of memory after 13.5 days
I'm currently reprocessing the bridge descriptor archive for #19317 (moved). The process, started with -Xmx6g
on a machine with 8G RAM, ran out of memory after 13.5 days. I uploaded the custom log with additional debug lines for the currently processed tarball here: https://people.torproject.org/~karsten/volatile/collector-bridgedescs.log.xz (556K).
While writing tests for #19755 (moved), I noticed a possible explanation, though I don't have facts to prove: BridgeSnapshotReader
contains a Set<String> descriptorImportHistory
that stores SHA-1 digests of files and single descriptors to skip duplicates as early as possible. Its effect can be seen in log lines like this, which comes from reprocessing 1 day of tarballs:
2016-07-28 11:54:31,206 DEBUG o.t.c.b.BridgeSnapshotReader:215 Finished
importing files in directory in/bridge-descriptors/. In total, we parsed
87 files (skipped 9) containing 24 statuses, 33984 server descriptors
(skipped 168368), and 29618 extra-info descriptors (skipped 50027).
I don't know a good way to confirm this theory other than running the process once again for a few days and logging the size of that set. I also tried attaching jvisualvm
last time, but for some reason that detached and froze after 90 hours.
Possible fixes:
- Use some kind of least-recently-used (or maybe least-recently-inserted if that's easier to implement) cache that allows us to skip duplicates in tarballs written on the same day or so. There's no harm in reprocessing a duplicate, it just takes more time than skipping it. Needs some testing to get the size right, though it seems from the log above that 100k entries might be enough.
- Avoid keeping a set and instead start the sanitizing process until we know enough about a descriptor to check whether we wrote it before. That would mean computing the SHA-1 digest and parsing until reaching the publication time. In early tests this increased processing time by factor 1.2 or 1.3, and even more processing time is not exactly what I'm looking for.
- Are there other options, ideally ones that are easy to implement and maintain?