Opened 3 years ago

Closed 3 years ago

#21751 closed enhancement (duplicate)

Use multiple threads to parse descriptors

Reported by: karsten Owned by: metrics-team
Priority: Medium Milestone:
Component: Metrics/Library Version:
Severity: Normal Keywords: metrics-2017
Cc: Actual Points:
Parent ID: Points:
Reviewer: Sponsor:


The following idea came up when I looked a bit into #17831 to speed up metrics-lib.

When we read and parse descriptors from disk, we're using a single thread to read and parse descriptors. It's a daemon thread and not the application's main thread, so if the application's thread is busy processing parsed descriptors we're at least using two threads. But we could parallelize even more by using separate threads for reading and parsing and even using multiple threads for reading and/or for parsing. I'll leave the I/O part to #17831 and focus on the multi-threaded parsing part here.

I wrote a little patch that measures time spent on reading tarball contents in DescriptorReaderImpl#readTarballs() and then extended that by moving descriptor parsing code to a separate class that implements Runnable and that gets executed by an ExecutorService. I initialized that executor with Executors.newFixedThreadPool(n) for n = [2, 4, 8, 16, 32, 64]. I also tried n = 1, but ran out of memory due to a major issue in my simple patch: it reads all tarball contents to memory when creating Task instances even if they cannot be executed anytime soon. What we should do is block the reader thread when it realizes that the executor is already full. I'm attaching my patch, but only to avoid starting from zero the next time. It needs more work.

separate parser threads read .tar file (s) parse .tar file (s) read .tar.xz file (s) parse .tar.xz file (s)
none (current code) 35 159 9 162
2 36 42 8 126
4 41 13 7 96
8 42 11 6 35
16 41 11 10 28
32 45 13 7 34
64 41 13 6 38

These results show that 4 threads speed up the parse time for .tar files by a factor 12 after which there's no visible improvement, and 8 threads speed up the parse time for .tar.xz files by a factor 4.6. Just from these numbers I'd suggest using 8 threads by default and making this number configurable for the application. But: needs more work.

My recommendation would be to look more into making parsing multi-threaded and save #17831 for later. It seems like parsing is the lower-hanging fruit.

Note that reading the same tarball in extracted form using the current code took 271 seconds. In that case the lower-hanging fruit might be I/O improvements, not multi-threaded parsing. But my hope is that not many applications extract tarballs containing over 800,000 files and read them using DescriptorReader, especially not if they could as well read the tarball directly.

Suggestions welcome! Otherwise I might pick this up again and move it forward whenever there's time.

Child Tickets

Attachments (1)

task-21751.diff (7.2 KB) - added by karsten 3 years ago.

Download all attachments as: .zip

Change History (7)

Changed 3 years ago by karsten

Attachment: task-21751.diff added

comment:1 Changed 3 years ago by iwakeh

Milestone: metrics-lib 1.9.0

This ticket depends on #21365.

comment:2 Changed 3 years ago by karsten

Milestone: metrics-lib 1.9.0

This is an important change, but it looks like we won't get to it before 2.1.0 for which we don't have a milestone yet. Unassigning from milestones until we make new plans. (The same goes for #21365.)

comment:3 Changed 3 years ago by karsten

Keywords: metrics-2018 added

comment:4 Changed 3 years ago by karsten

Keywords: metrics-2017 added; metrics-2018 removed

comment:5 Changed 3 years ago by iwakeh

Look at java 8 features for accomplishing this task, e.g. #23752 comments 6&2.

Last edited 3 years ago by iwakeh (previous) (diff)

comment:6 Changed 3 years ago by iwakeh

Resolution: duplicate
Status: newclosed


Comment:9 of #23752 explicitly states to use the data above for comparison.

Note: See TracTickets for help on using tickets.