Opened 2 years ago

Last modified 7 months ago

#20395 assigned defect

Add capability to handle large descriptor files

Reported by: iwakeh Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Library Version:
Severity: Normal Keywords:
Cc: metrics-team Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

Discovered OOM problem for large descriptor files in #20335, see comment 12 there for how to reproduce.

Child Tickets

Change History (24)

comment:1 Changed 2 years ago by iwakeh

One remedy for the out-of-memory would be using FileChannels for reading descriptor files.

comment:2 Changed 2 years ago by iwakeh

Milestone: metrics-lib 1.5.0

comment:3 Changed 2 years ago by iwakeh

Milestone: metrics-lib 1.5.0metrics-lib 2.0.0

comment:4 Changed 20 months ago by karsten

I just opened #22140 and #22141, and I believe we need to resolve at least the latter before being able to resolve this one.

comment:5 Changed 19 months ago by iwakeh

Lazy reading/providing of bytes could leave the code referred to in #22140 and #22141 as is?
Maybe worth a try; the descriptor only stores the file location and position and provides raw bytes only on request by accessing the file system and not from memory?

comment:6 Changed 19 months ago by karsten

Hmm, can you explain that plan a bit more? How would we implement DescriptorFile#getDescriptors() and DescriptorParser#parseDescriptors() that both return a List<Descriptor> without having read the bytes from the file system?

And when would the reading and parsing happen and by whom? Would that not be done by the background parsing thread (or in the future: threads) anymore?

Last edited 19 months ago by karsten (previous) (diff)

comment:7 Changed 19 months ago by iwakeh

I hope I didn't overlook anything:

DescriptorFile#getDescriptors() and DescriptorParser#parseDescriptors() don't access files. They receive Descriptor objects or bytes and will have to keep the bytes, but these methods don't cause an oom unless their caller provides too much.

The problem lies in the implementation of DescriptorReaderImpl$DescriptorReaderRunnable (which - as an aside - should be a separate class). There the readFile method attempts to read an entire file and chokes when encountering a huge file. DescriptorReaderRunnable should check the file size before opening in order to handle the files according to their size. The oom is caused by reading the entire file into memory and then operating on it in-memory creating all the Descriptor objects (possibly copying the raw bytes, I didn't verify) in-memory. Memory usage could be reduced

  1. by only reading parts of the huge file and also
  2. by not adding the bytes to the descriptor objects and instead simply keeping the file path and position inside the file in-memory.

Assumptions:

  • many Descriptor objects w/o bytes occupy way less space than the Descriptor objects do currently
  • the descriptor containing files are available as long as there are Descriptor objects referring to them

A sketch of changes:

  • Introduce descriptors that either hold their bytes in-memory or have a file path and in-file position(s) for accessing raw bytes, but don't store the bytes.
  • DescriptorImpl parses bytes and produces a list of the adapted Descriptor objects.
  • DescriptorReaderRunnable needs to read a certain chunk of a large file, parse enough to determine the next descriptor, and provide the parser also with the beginning and end positions in the file.

This stays very closely to the current implementation, the details need some more work, and it might be necessary to change more.

comment:8 in reply to:  7 Changed 19 months ago by karsten

Great ideas above! And I think we should implement them, because they clearly improve memory consumption.

Going into more details, your second assumption makes sense to me. I didn't think of that before, but I agree that we can make that assumption.

However, your first assumption is unfortunately wrong. I just concatenated all votes from May 1, 2017 to a single file with a size of 0.8G. I passed that to metrics-lib to read and parse it, which consumed 4.1G of memory in total for parsed descriptors and contained raw descriptor bytes. I then modified DescriptorImpl to avoid storing raw descriptor bytes in memory, which led to memory consumption of 3.3G. The difference is precisely the 0.8G of the original file. But the 3.3G still remain, and that number will grow with the number of descriptors we put in a file. Like, 72 hours of votes from an initial CollecTor sync with all votes concatenated to a single file would consume 9.9G, plus a few G more while parsing. No, the suggestion above is certainly an improvement, but it still does not scale.

But I could see us making these suggested improvements anyway, and they'll help us going forward. Some thoughts:

  • We could modify Descriptor#getRawDescriptorBytes() to use its file reference and start and end position to retrieve the bytes from disk and return them to the caller. That is, rather than bothering the user to do that. This would even make this change backward-compatible.
  • We should avoid calling that new Descriptor#getRawDescriptorBytes() ourselves at all costs while parsing and instead pass the bytes around directly. I'm mentioning this explicitly, because I found uses of that method where we could have passed around these bytes as parameters instead.
  • We need to be careful to write the reading-files-in-chunks logic in a way that detects descriptor starts and ends across chunk boundaries. Think of tiny descriptors like microdescriptors.
  • And we should avoid scanning chunks repeatedly when a descriptor covers many, many such chunks. Think of huge descriptors like votes.

Once we're there, let's talk more about avoiding to keep potentially huge lists of parsed descriptors in memory.

Do you want to start hacking on your suggestions above?

comment:9 Changed 19 months ago by iwakeh

Owner: changed from karsten to iwakeh
Status: newaccepted

I'll look into this further.

comment:10 Changed 19 months ago by iwakeh

Milestone: metrics-lib 2.0.0metrics-lib 1.9.0

comment:11 Changed 19 months ago by karsten

Milestone: metrics-lib 1.9.0

This is a cool improvement, but it currently looks like we won't get to this before 2.1.0. Unassigning from milestone.

comment:12 Changed 15 months ago by karsten

Summary: metrics-lib should be able to handle large descriptor filesAdd capability to handle large descriptor files

Tweak summary.

comment:13 Changed 11 months ago by iwakeh

Owner: changed from iwakeh to metrics-team
Status: acceptedassigned

Move to metrics-team as these are not worked on by me during the next week.

comment:14 Changed 10 months ago by karsten

Owner: changed from metrics-team to karsten
Status: assignedaccepted

comment:15 Changed 10 months ago by karsten

Status: acceptedneeds_review

I started making some improvements here. Here's my train of thought:

  1. Rather than reading the whole file to memory at the beginning, we could read it in chunks and start parsing as soon as we have seen a full descriptor. This sounds like a useful improvement, but it's actually very limited, at least on its own. Reading the 70M descriptor file I used for testing is actually done really fast. It's the parsing that takes long. As long as we need the full descriptor file contents in memory, we don't have to think about reading files in chunks. (See also 3. below.)
  2. Rather than parsing all descriptors contained in a given file into a list and then taking all parsed descriptors and throwing them into the BlockingIterator<Descriptor>, we could just skip the list in the middle. The effect is that the time to first descriptor is reduced by a huge amount of time, whereas the time to last descriptor stays the same. I prepared a patch for this. The commit message contains more details.
  3. Rather than storing descriptor file contents in a byte[], we could go through the file, read descriptor by descriptor, and store a File reference together with offset and length into the file. The effect would be that we're avoiding to keep the raw descriptor file contents in memory at all. We'd still keep parsed contents in memory. A possible downside is that the file must not be deleted or moved away while the application processes descriptors, which should be safe to require. Still, this is a larger change than 2. And it requires 1. That's why I postponed this.

Please review commit ef9406c in my task-20395 branch.

comment:16 Changed 10 months ago by iwakeh

Reviewer: iwakeh

Grabbing this review.

comment:17 Changed 10 months ago by iwakeh

Status: needs_reviewneeds_revision

Thinking for a while and looking at the proposed branch I'd say it's finally time for an API change.

The proposed changes pretty much result in a blocking queue that knows when no more items will be added.

Therefore, it might be very useful for clients using metrics-lib to have BlockingQueue<Descriptor> descriptorQueue in DescriptorReader, where they can choose if they want to call a blocking method or a timeout method. The new BlockingQueue could be based on ArrayBlockingQueue with the additional methods boolean outOfDescriptors() and boolean setOutOfDescriptors().

The queue would also facilitate more than one consumer to access the descriptors made available instead of iterating or waiting for a list of descriptors. Adding elements in parallel would also be made easier.

We could provide the new method BlockingQueue<Descriptor> descriptorQueue while deprecating the old Iterable<Descriptor> readDescriptors before removing the old method. Thus, paving the way to faster more efficient processing on both the client's and metrics-lib's side.

comment:18 Changed 10 months ago by iwakeh

Consider also #24166, which would be easily achievable with the suggestion in comment:17.

comment:19 Changed 10 months ago by karsten

Hmm. I see the benefits of such an API change. And we already have #22678 that describes something very close, if not the same.

But should we separate the API change from this fix? In a way, I consider it a bug that we're parsing an entire descriptor file before passing all descriptors to the application. It's something we should have changed when taking out DescriptorFile.

How about I rewrite my patch to not make any changes to the API (in particular DescriptorParser), and then we discuss the API change, including how to do it in a backward-compatible fashion, as part of #22678?

comment:20 in reply to:  19 Changed 10 months ago by iwakeh

Replying to karsten:

Hmm. I see the benefits of such an API change. And we already have #22678 that describes something very close, if not the same.

Yes!

But should we separate the API change from this fix? In a way, I consider it a bug that we're parsing an entire descriptor file before passing all descriptors to the application. It's something we should have changed when taking out DescriptorFile.

How about I rewrite my patch to not make any changes to the API (in particular DescriptorParser), and then we discuss the API change, including how to do it in a backward-compatible fashion, as part of #22678?

I think all effort is better spent renewing the API. You might name the issue at hand a 'bug', but it really is not. The issues we're dealing with are historically grown design decisions that should be modernized now, not patched and stitched to last longer. There is no way around the API change, so the sooner the better.

comment:21 Changed 10 months ago by karsten

I think the two issues are unrelated. One is about adding parsed descriptors to a data structure, the other is about using different code for the data structure.

Let's just try to fix this in the next couple of days. If that includes a backward-compatible API change, that's okay. Want to grab this?

comment:22 Changed 10 months ago by iwakeh

Owner: changed from karsten to iwakeh
Status: needs_revisionaccepted

I'll give it try.

comment:23 Changed 9 months ago by iwakeh

Cc: metrics-team added
Reviewer: iwakeh

This is intertwined with #22678 and might resolve #21365 as obsolete.

comment:24 Changed 7 months ago by iwakeh

Owner: changed from iwakeh to metrics-team
Status: acceptedassigned

Won't be starting work on this next week. -> re-assigning to metrics-team.

Note: See TracTickets for help on using tickets.