Add capability to handle large descriptor files

added component::metrics/library owner::metrics-team priority::medium severity::normal status::assigned type::defect labels

One remedy for the out-of-memory would be using FileChannels for reading descriptor files.

Trac:
Milestone: N/A to metrics-lib 1.5.0

Trac:
Milestone: metrics-lib 1.5.0 to metrics-lib 2.0.0

I just opened #22140 (moved) and #22141 (moved), and I believe we need to resolve at least the latter before being able to resolve this one.

Lazy reading/providing of bytes could leave the code referred to in #22140 (moved) and #22141 (moved) as is? Maybe worth a try; the descriptor only stores the file location and position and provides raw bytes only on request by accessing the file system and not from memory?

Hmm, can you explain that plan a bit more? How would we implement DescriptorFile#getDescriptors() and DescriptorParser#parseDescriptors() that both return a List<Descriptor> without having read the bytes from the file system?

And when would the reading and parsing happen and by whom? Would that not be done by the background parsing thread (or in the future: threads) anymore?

I hope I didn't overlook anything:

DescriptorFile#getDescriptors() and DescriptorParser#parseDescriptors() don't access files. They receive Descriptor objects or bytes and will have to keep the bytes, but these methods don't cause an oom unless their caller provides too much.

The problem lies in the implementation of DescriptorReaderImpl$DescriptorReaderRunnable (which - as an aside - should be a separate class). There the readFile method attempts to read an entire file and chokes when encountering a huge file. DescriptorReaderRunnable should check the file size before opening in order to handle the files according to their size. The oom is caused by reading the entire file into memory and then operating on it in-memory creating all the Descriptor objects (possibly copying the raw bytes, I didn't verify) in-memory. Memory usage could be reduced

by only reading parts of the huge file and also
by not adding the bytes to the descriptor objects and instead simply keeping the file path and position inside the file in-memory.

Assumptions:

many Descriptor objects w/o bytes occupy way less space than the Descriptor objects do currently
the descriptor containing files are available as long as there are Descriptor objects referring to them

A sketch of changes:

Introduce descriptors that either hold their bytes in-memory or have a file path and in-file position(s) for accessing raw bytes, but don't store the bytes.
DescriptorImpl parses bytes and produces a list of the adapted Descriptor objects.
DescriptorReaderRunnable needs to read a certain chunk of a large file, parse enough to determine the next descriptor, and provide the parser also with the beginning and end positions in the file.

This stays very closely to the current implementation, the details need some more work, and it might be necessary to change more.

Great ideas above! And I think we should implement them, because they clearly improve memory consumption.

Going into more details, your second assumption makes sense to me. I didn't think of that before, but I agree that we can make that assumption.

However, your first assumption is unfortunately wrong. I just concatenated all votes from May 1, 2017 to a single file with a size of 0.8G. I passed that to metrics-lib to read and parse it, which consumed 4.1G of memory in total for parsed descriptors and contained raw descriptor bytes. I then modified DescriptorImpl to avoid storing raw descriptor bytes in memory, which led to memory consumption of 3.3G. The difference is precisely the 0.8G of the original file. But the 3.3G still remain, and that number will grow with the number of descriptors we put in a file. Like, 72 hours of votes from an initial CollecTor sync with all votes concatenated to a single file would consume 9.9G, plus a few G more while parsing. No, the suggestion above is certainly an improvement, but it still does not scale.

But I could see us making these suggested improvements anyway, and they'll help us going forward. Some thoughts:

We could modify Descriptor#getRawDescriptorBytes() to use its file reference and start and end position to retrieve the bytes from disk and return them to the caller. That is, rather than bothering the user to do that. This would even make this change backward-compatible.
We should avoid calling that new Descriptor#getRawDescriptorBytes() ourselves at all costs while parsing and instead pass the bytes around directly. I'm mentioning this explicitly, because I found uses of that method where we could have passed around these bytes as parameters instead.
We need to be careful to write the reading-files-in-chunks logic in a way that detects descriptor starts and ends across chunk boundaries. Think of tiny descriptors like microdescriptors.
And we should avoid scanning chunks repeatedly when a descriptor covers many, many such chunks. Think of huge descriptors like votes.

Once we're there, let's talk more about avoiding to keep potentially huge lists of parsed descriptors in memory.

Do you want to start hacking on your suggestions above?

I'll look into this further.

Trac:
Owner: karsten to iwakeh
Status: new to accepted

Trac:
Milestone: metrics-lib 2.0.0 to metrics-lib 1.9.0

This is a cool improvement, but it currently looks like we won't get to this before 2.1.0. Unassigning from milestone.

Trac:
Milestone: metrics-lib 1.9.0 to N/A

Tweak summary.

Trac:
Summary: metrics-lib should be able to handle large descriptor files to Add capability to handle large descriptor files

Move to metrics-team as these are not worked on by me during the next week.

Trac:
Owner: iwakeh to metrics-team
Status: accepted to assigned

I'll grab this, as it's bugging users right now: https://lists.torproject.org/pipermail/metrics-team/2018-February/000667.html

Trac:
Owner: metrics-team to karsten
Status: assigned to accepted

I started making some improvements here. Here's my train of thought:

Rather than reading the whole file to memory at the beginning, we could read it in chunks and start parsing as soon as we have seen a full descriptor. This sounds like a useful improvement, but it's actually very limited, at least on its own. Reading the 70M descriptor file I used for testing is actually done really fast. It's the parsing that takes long. As long as we need the full descriptor file contents in memory, we don't have to think about reading files in chunks. (See also 3. below.)
Rather than parsing all descriptors contained in a given file into a list and then taking all parsed descriptors and throwing them into the BlockingIterator<Descriptor>, we could just skip the list in the middle. The effect is that the time to first descriptor is reduced by a huge amount of time, whereas the time to last descriptor stays the same. I prepared a patch for this. The commit message contains more details.
Rather than storing descriptor file contents in a byte[], we could go through the file, read descriptor by descriptor, and store a File reference together with offset and length into the file. The effect would be that we're avoiding to keep the raw descriptor file contents in memory at all. We'd still keep parsed contents in memory. A possible downside is that the file must not be deleted or moved away while the application processes descriptors, which should be safe to require. Still, this is a larger change than 2. And it requires 1. That's why I postponed this.

Please review commit ef9406c in my task-20395 branch.

Trac:
Status: accepted to needs_review

Grabbing this review.

Trac:
Reviewer: N/A to iwakeh

Thinking for a while and looking at the proposed branch I'd say it's finally time for an API change.

The proposed changes pretty much result in a blocking queue that knows when no more items will be added.

Therefore, it might be very useful for clients using metrics-lib to have BlockingQueue<Descriptor> descriptorQueue in DescriptorReader, where they can choose if they want to call a blocking method or a timeout method. The new BlockingQueue could be based on ArrayBlockingQueue with the additional methods boolean outOfDescriptors() and boolean setOutOfDescriptors().

The queue would also facilitate more than one consumer to access the descriptors made available instead of iterating or waiting for a list of descriptors. Adding elements in parallel would also be made easier.

We could provide the new method BlockingQueue<Descriptor> descriptorQueue while deprecating the old Iterable<Descriptor> readDescriptors before removing the old method. Thus, paving the way to faster more efficient processing on both the client's and metrics-lib's side.

Trac:
Status: needs_review to needs_revision

Consider also #24166 (moved), which would be easily achievable with the suggestion in comment:17.

Hmm. I see the benefits of such an API change. And we already have #22678 (moved) that describes something very close, if not the same.

But should we separate the API change from this fix? In a way, I consider it a bug that we're parsing an entire descriptor file before passing all descriptors to the application. It's something we should have changed when taking out DescriptorFile.

How about I rewrite my patch to not make any changes to the API (in particular DescriptorParser), and then we discuss the API change, including how to do it in a backward-compatible fashion, as part of #22678 (moved)?

Replying to karsten:

Hmm. I see the benefits of such an API change. And we already have #22678 (moved) that describes something very close, if not the same.

Yes!

But should we separate the API change from this fix? In a way, I consider it a bug that we're parsing an entire descriptor file before passing all descriptors to the application. It's something we should have changed when taking out DescriptorFile.

How about I rewrite my patch to not make any changes to the API (in particular DescriptorParser), and then we discuss the API change, including how to do it in a backward-compatible fashion, as part of #22678 (moved)?

I think all effort is better spent renewing the API. You might name the issue at hand a 'bug', but it really is not. The issues we're dealing with are historically grown design decisions that should be modernized now, not patched and stitched to last longer. There is no way around the API change, so the sooner the better.

I think the two issues are unrelated. One is about adding parsed descriptors to a data structure, the other is about using different code for the data structure.

Let's just try to fix this in the next couple of days. If that includes a backward-compatible API change, that's okay. Want to grab this?

I'll give it try.

Trac:
Status: needs_revision to accepted
Owner: karsten to iwakeh

This is intertwined with #22678 (moved) and might resolve #21365 (moved) as obsolete.

Trac:
Cc: N/A to metrics-team
Reviewer: iwakeh to N/A

Won't be starting work on this next week. -> re-assigning to metrics-team.

Trac:
Owner: iwakeh to metrics-team
Status: accepted to assigned

mentioned in issue #21365 (moved)

mentioned in issue #22140 (moved)

mentioned in issue #22141 (moved)

mentioned in issue #22678 (moved)

mentioned in issue #25221 (moved)

mentioned in issue #27716 (moved)

mentioned in issue #33502 (moved)

Add capability to handle large descriptor files

Child items ...

Activity