Opened 3 years ago

Closed 2 years ago

#19778 closed defect (duplicate)

Bridge descriptor sanitizer runs out of memory after 13.5 days

Reported by: karsten Owned by: iwakeh
Priority: High Milestone: CollecTor 1.2.0
Component: Metrics/CollecTor Version:
Severity: Normal Keywords:
Cc: iwakeh Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

I'm currently reprocessing the bridge descriptor archive for #19317. The process, started with -Xmx6g on a machine with 8G RAM, ran out of memory after 13.5 days. I uploaded the custom log with additional debug lines for the currently processed tarball here: https://people.torproject.org/~karsten/volatile/collector-bridgedescs.log.xz (556K).

While writing tests for #19755, I noticed a possible explanation, though I don't have facts to prove: BridgeSnapshotReader contains a Set<String> descriptorImportHistory that stores SHA-1 digests of files and single descriptors to skip duplicates as early as possible. Its effect can be seen in log lines like this, which comes from reprocessing 1 day of tarballs:

2016-07-28 11:54:31,206 DEBUG o.t.c.b.BridgeSnapshotReader:215 Finished
importing files in directory in/bridge-descriptors/.  In total, we parsed
87 files (skipped 9) containing 24 statuses, 33984 server descriptors
(skipped 168368), and 29618 extra-info descriptors (skipped 50027).

I don't know a good way to confirm this theory other than running the process once again for a few days and logging the size of that set. I also tried attaching jvisualvm last time, but for some reason that detached and froze after 90 hours.

Possible fixes:

  • Use some kind of least-recently-used (or maybe least-recently-inserted if that's easier to implement) cache that allows us to skip duplicates in tarballs written on the same day or so. There's no harm in reprocessing a duplicate, it just takes more time than skipping it. Needs some testing to get the size right, though it seems from the log above that 100k entries might be enough.
  • Avoid keeping a set and instead start the sanitizing process until we know enough about a descriptor to check whether we wrote it before. That would mean computing the SHA-1 digest and parsing until reaching the publication time. In early tests this increased processing time by factor 1.2 or 1.3, and even more processing time is not exactly what I'm looking for.
  • Are there other options, ideally ones that are easy to implement and maintain?

Child Tickets

Attachments (1)

task-19778-commits-and-config.tar (36.0 KB) - added by karsten 3 years ago.

Download all attachments as: .zip

Change History (8)

comment:1 Changed 3 years ago by iwakeh

I might be easy to infer, but could you add the commit id of the CollecTor you were using and a brief list of settings changed from the default config file of this commit?

Thanks!

Changed 3 years ago by karsten

comment:2 Changed 3 years ago by karsten

Certainly. I just attached a tiny tarball with three commits based on 050a88ffcf2b205a63741d4848951ce91c0bd02f and the collector.properties file. Thanks for looking!

comment:3 Changed 3 years ago by iwakeh

Owner: set to iwakeh
Status: newassigned

comment:4 Changed 3 years ago by iwakeh

Milestone: CollecTor 1.1.0CollecTor 1.2.0

comment:5 Changed 3 years ago by karsten

This issue is closely related to #20236, though I couldn't decide which one to close as duplicate.

Oh, and regarding the LRU cache, maybe we can use LinkedHashMap for this as described here.

comment:6 in reply to:  5 Changed 3 years ago by iwakeh

Replying to karsten:

This issue is closely related to #20236, though I couldn't decide which one to close as duplicate.

I would opt for closing this (have a note in #20236 to pay attention to the findings here), b/c #20236 has a branch attached.

Oh, and regarding the LRU cache, maybe we can use LinkedHashMap for this as described here.

comment:7 Changed 2 years ago by iwakeh

Resolution: duplicate
Status: assignedclosed

Closing as duplicate of #20236. Added a note there.

Note: See TracTickets for help on using tickets.