Add code to eventually support .xz compressed tarballs

added component::metrics/library owner::karsten priority::medium resolution::implemented severity::normal status::closed type::enhancement labels

It looks like streaming reads from tar.xz are faster than tar.bz2. A good sign. I'll make a benchmark for you.

Here's a benchmark. It might be a bit verbose but it shows that compared to tar.bz2 (previously), tar.xz is an improvement both in file size and streaming read performance. Note the benchmark provides some totals but doesn't make an attempt to hide read latency. If setup and read are separate from a dedicated thread for parsing it's conceivable to hide the read behind the parse.

I'll look at instrumenting metrics-lib next (on a private copy). I see a potential improvement which may be useful while adding the .xz support. Namely it looks like the read from the archive could be improved.

I tried to fix a typo in the file description, which is why you see two attachments. They're the same file.

Trac:
Username: leeroy

Benchmark16424.java

Benchmark straming reads from tar, tar.bz2, tar.xz.

Trac:
Username: leeroy

Benchmark16424.2.java

Benchmark streaming reads from tar, tar.bz2, tar.xz.

Trac:
Username: leeroy

Thanks for working on the benchmark code. Would you be able to attach some results?

I'm also curious about those other improvements you're mentioning.

Thanks!

Which other improvements? The parsing improvement or read from archive? If the parsing improvement, it's a surprise. As far as the read from archives is concerned, I'll paraphrase in pseudo-pseudocode what metrics-lib does versus what I do in the benchmark.

Metrics-lib

(8k reads) get bufferedinputstream(get tararchiveinputstream(get fileinputstream from archive))
(set position) get a tar entry corresponding to a contained descriptor file
construct an unbounded bytearrayoutputstream of unknown size requiring JVM management
read 1k from bufferedinputstream into a separate 1k buffer
copy this 1k buffer into bytearrayoutputstream
repeat (4) and (5) until all bytes of the tar entry are read
now bytearrayoutputstream is a copy of the tar entry, convert bytearrayoutputstream to bytearray (make another copy)
parse the second copy

Benchmark

(8k reads) get archiveinputstream(get bufferedinputstream(get fileinputstream from archive))
(set position) get an archive entry corresponding to a contained descriptor file
construct a bounded byte array from the known size of the archive entry, a primitive which is only GC'd by JVM
read the archive entry into the byte array
parse the byte array

Maybe it doesn't matter much. Which is why I'll next be doing some testing of metrics-lib performance. I'll post some results once had a chance to do the tests on metrics-lib. Mostly the read is only noticeable during large archive entries. If you want to try it out put the partial 2015-07.tar.xz archives in the same folder as the benchmark, and make sure you have metrics-lib plus it's dependencies (libcommons-codec, libcommons-compress), then compile/run making sure to set your classpath.

javac -cp /usr/share/java/*:.:descriptor.jar Benchmark16424.java

java -cp /usr/share/java/*:.:descriptor.jar Benchmark16424

Trac:
Username: leeroy

The patch reimplements the compressed tarball support and makes some attempts at improving buffering. It also requires a small change to the build file for metrics-lib.

Fixed the exception being set when a parse error happens.

Trac:
Username: leeroy

Trac:
Username: leeroy
Status: new to needs_review

Trac:
Username: leeroy

0001-Support-parsing-of-xz-compressed-tarballs.2.patch

Trac:
Username: leeroy

0001-Support-parsing-of-xz-compressed-tarballs.patch

Trac:
Username: leeroy

Hi leeroy,

I reviewed your 0001-Support-parsing-of-xz-compressed-tarballs.patch, pushed it to my public Git repository, and added some commits. Please take a look at these changes and let me know if you like them or not:

https://gitweb.torproject.org/user/karsten/metrics-lib.git/log/?h=task-16424

Here's some more feedback:

Sorry for being picky about whitespace and code style. I'm not at all saying that the styles used in metrics-lib are perfect. We can discuss code style changes if you feel strongly about something. I just think that being consistent is important to make the code more readable.
Can you create a metrics-lib (and Onionoo) repository somewhere (GitHub, Bitbucket, etc.) and post future patches as branches there? That would make it much easier to split changes into several commits.
I'm yet unclear whether we should raise the Java version from 6 to 7 for this change. The effect is that all applications using metrics-lib would suddenly have to upgrade, too. Can we postpone this upgrade? (We'd have to undo the try-with-resources part.)
That change to the Ant build file where you include dependent jar files in the produced jar file seems unrelated to adding parsing support for .xz files. It might be a good idea to do this, but can we postpone (as in: undo) this change, too?
I didn't test this branch yet. It's quite possible that I broke it.

Thanks for working on this! Much appreciated.

Trac:
Status: needs_review to needs_revision

Thank you for the feedback. Please do be picky about whitespace and style.

Yes I will do as you suggest about the repository. The only reason I haven't sooner is due to using trac as a sort of backup.

Java 7 is additive wrt the api. Any 1.6 based applications will compile fine. But yes I can remove it. metrics-lib is exactly the kind of application that benefits from 1.7...

Which file are you referring to? xz.jar (libxz)? it is a required include. It's just done by libcommons-compress normally. This include makes the dependency clear. xz support is an optional download of libcommons-compress. This include detects it's absence.

Is it the other two includes, commons-codec.jar, and commons-compress.jar? They are also required includes for metrics-lib functionality. The only reason it hasn't failed sooner is because you only use it in onionoo (and onionoo does the include for you). For metrics-lib to be general these includes are needed...

Trac:
Username: leeroy

Replying to leeroy:

Thank you for the feedback. Please do be picky about whitespace and style.

Thanks for understanding!

Yes I will do as you suggest about the repository. The only reason I haven't sooner is due to using trac as a sort of backup.

Oh. :)

Java 7 is additive wrt the api. Any 1.6 based applications will compile fine. But yes I can remove it. metrics-lib is exactly the kind of application that benefits from 1.7...

I could imagine switching to 1.7 soon, I'd just prefer merging this patch even sooner. New ticket?

Which file are you referring to? xz.jar (libxz)? it is a required include. It's just done by libcommons-compress normally. This include makes the dependency clear. xz support is an optional download of libcommons-compress. This include detects it's absence.

Yes, the include is fine. I mean the <zipgroupfileset/> element.

Is it the other two includes, commons-codec.jar, and commons-compress.jar? They are also required includes for metrics-lib functionality. The only reason it hasn't failed sooner is because you only use it in onionoo (and onionoo does the include for you). For metrics-lib to be general these includes are needed...

Well, it would work if Onionoo (and others) would include xz.jar themselves, even without this change to the build file, right? This is also something we could change soon but which shouldn't be related to adding .xz parsing support. For example, one thing I'm concerned about is that we're using different commons-codec.jar versions for building descriptor.jar and as a dependency in Onionoo. It could well be that this concern is not valid, but I'd want to think more about that and any other implications when I have more time. New ticket?

I can't really comment about the use of libcommons-codec instead of a particular version. Would it help that most operators of metrics-lib will also be running newer versions than the version I develop with (which is from oldstable)? In general most operators will experience this dilemma as different distributions come with different versions. Personally I think it's not a problem as long as the code works with an oldstable version of libcommons (it does).

<zipgroupfileset/> is the way to import into jarspace. It doesn't work any other way. I'm concerned that you would prefer to import this into onionoo instead of metrics-lib where the feature is implemented.

If onionoo's build file were updated as in #16371 (moved) then onionoo wouldn't be using a different version.

Trac:
Username: leeroy

Oh, I forgot one important detail about the xz archive support. It requires it's own core for optimal performance. The reads tend to finish before parsing it all though. Using file parsing, instead of xz archives, tends to be single-cpu bound because along the way metrics-lib blocks waiting for parse to complete. Is this intended?

Trac:
Username: leeroy

Regarding importing dependent jars into descriptor.jar, I can see pros and cons. Right now, we're producing a minimal descriptor.jar file and requiring applications to also add certain versions of other libraries into their classpath. With your change we'd produce a larger descriptor.jar that comes with the jars it needs, at the risk of producing uncertain behavior if applications use older or newer versions of those libraries in their classpaths. However, we don't have to solve this in this ticket, which is why I'd rather want to move this discussion to its own ticket. Sound okay?

With respect to your observation of blocking until parsing is complete, please take a look at DescriptorReader.setMaxDescriptorFilesInQueue and try setting it to a larger value. It's actually supposed to be a feature not to load and parse too many descriptors if the application doesn't process them. It's quite possible that there's a smarter way to achieve this. That discussion would go into a new ticket, too.

Let me summarize:

We should merge your patch without the 1.7 upgrade and without <zipgroupfileset/>. Want to undo these changes?
We should open two or three new tickets:
- Upgrade to Java 1.7,
- Include dependent jars in descriptor.jar, and
- Find smarter metric than number of descriptor files to block loading and parsing until descriptors have been processed.

Add code to eventually support .xz compressed tarballs

Child items ...

Activity