Opened 6 years ago

Closed 5 years ago

#11788 closed enhancement (fixed)

Consider providing descriptor tarballs as .tar.xz rather than .tar.bz2

Reported by: karsten Owned by:
Priority: Medium Milestone:
Component: Metrics/CollecTor Version:
Severity: Keywords:
Cc: nickm, wfn, Yawning Actual Points:
Parent ID: Points:
Reviewer: Sponsor:

Description

nickm notes that xz -9 compresses descriptor tarballs a lot better than bzip2.

Sample 1: file sizes in kB for May consensuses:

22620 consensuses-bzip2.bz2
 2532 consensuses-xz.xz
 1948 consensuses-xz9.xz

(Will add another sample once yatei is done compressing April votes.)

Switching is as easy as editing the shell script that is run every 3 days on yatei. Recompressing existing tarballs is also just a shell command away.

Are there drawbacks to consider? Maybe:

  • Compression will take longer; right now, at the end of a month, yatei spends about 1 hour on running bzip2 on the various tarballs. That might become 2 or 3 hours with xz.
  • People won't find tarballs under the usual URL, because their file extensions will change. (https://metrics.torproject.org/data.html is going to list the correct URLs though.)
  • Anything else?

Child Tickets

Change History (9)

comment:1 Changed 6 years ago by wfn

A purely procedural/logistical thing: I wonder how many services/tools use the Metrics archives, and whether it makes sense to convert all existing/previous .tar.bz2 archives to .tar.xz. Of course as Karsten says, if the latter is not done, "the downside is that new tools will have to support both .tar.bz2 and .tar.xz if we don't recompress existing archives."

In any case, quietly changing to .tar.xz is maybe not the way to go, in the sense that this should at the very least be announced. How many existing tools/software may rely on these Metrics archives?

Anything else?

Memory usage?[1][2] Though as Nick said, xz memory usage seems to be constant / invariant to target size (depends on compression level only), I guess because the compression level chooses the dictionary size; and that is what uses the memory.

[1]: http://pokecraft.first-world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO
[2]: http://linux.die.net/man/1/xz

comment:2 in reply to:  1 ; Changed 6 years ago by karsten

Replying to wfn:

A purely procedural/logistical thing: I wonder how many services/tools use the Metrics archives, and whether it makes sense to convert all existing/previous .tar.bz2 archives to .tar.xz. Of course as Karsten says, if the latter is not done, "the downside is that new tools will have to support both .tar.bz2 and .tar.xz if we don't recompress existing archives."

Right, I think at some point we'll want to provide archives using a single compression method.

In any case, quietly changing to .tar.xz is maybe not the way to go, in the sense that this should at the very least be announced. How many existing tools/software may rely on these Metrics archives?

Agreed about not making this change quietly. Here's what we could do:

  1. Start compressing new tarballs with xz in addition to bzip2 and recompress existing tarballs using xz but without deleting the bzip2 ones. Change links on https://metrics.torproject.org/data.html to the .tar.xz tarballs. Tell people on tor-dev@ about the change, but say that .tar.bz2 tarballs will be available for another two months.
  2. Two months later, stop creating .tar.bz2 tarballs and delete existing .tar.bz2 tarballs.

Anything else?

Memory usage?[1][2] Though as Nick said, xz memory usage seems to be constant / invariant to target size (depends on compression level only), I guess because the compression level chooses the dictionary size; and that is what uses the memory.

[1]: http://pokecraft.first-world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO
[2]: http://linux.die.net/man/1/xz

Right. The first link you mention there says we'll need up to 673MB for compressing and up to 64MB for decompressing a tarball using xz. Sounds reasonable.

comment:3 in reply to:  2 Changed 6 years ago by wfn

Replying to karsten:

[...]
Agreed about not making this change quietly. Here's what we could do:

  1. Start compressing new tarballs with xz in addition to bzip2 and recompress existing tarballs using xz but without deleting the bzip2 ones. Change links on https://metrics.torproject.org/data.html to the .tar.xz tarballs. Tell people on tor-dev@ about the change, but say that .tar.bz2 tarballs will be available for another two months.
  2. Two months later, stop creating .tar.bz2 tarballs and delete existing .tar.bz2 tarballs.

FWIW, this plan sounds good to me. :)

comment:4 Changed 6 years ago by karsten

Sample 2:

$ ls -lh votes-2014-04.tar.bz2 
-rw-r--r-- 1 metrics metrics 4.9G May  7 06:15 votes-2014-04.tar.bz2
$ bunzip2 votes-2014-04.tar.bz2
$ ls -lh votes-2014-04.tar
-rw-r--r-- 1 metrics metrics  13G May  7 14:14 votes-2014-04.tar
$ time xz -9 votes-2014-04.tar
real    123m8.199s
user    117m30.129s
sys     0m21.541s
$ ls -lh votes-2014-04.tar.xz
-rw-r--r-- 1 metrics metrics 172M May  7 14:14 votes-2014-04.tar.xz

That's an impressive reduction by factor 29. I had no idea!

What will be funny is when people decompress a few votes tarballs (or even all of them) on their hard disk and find that these tarballs occupy 77 times the disk space as in compressed form. Guess we should add a warning to data.html.

comment:5 Changed 6 years ago by karsten

Resolution: implemented
Status: newclosed

Recompressed the archive, updated the website, informed tor-dev@. Closing.

comment:6 Changed 6 years ago by karsten

arma suspectes that bzip2 -9 might already have reduced tarball size a lot. Tested with two tarballs:

tarball bzip2 -1 bzip2 -9 xz -9
consensuses-2014-04 241M 217M 21M
votes-2014-04 4.8G 4.4G 172M

So, it seems that switching from bzip2 -1 to bzip2 -9 would not have made that much of a change compared to xz -9.

comment:7 Changed 6 years ago by arma

Gosh. Thanks!

comment:8 Changed 5 years ago by karsten

Cc: Yawning added
Resolution: implemented
Status: closedreopened

Yawning suggests xz -9e as an alternative to xz -9. Here are some results:

Command Decompressed size Compression time Compressed size Decompression Time
xz -9 13G 56m54.372s 172M (1.29%) 0m33.692s
xz -9e 13G 105m23.670s 115M (0.86%) 0m30.137s

That's 67% of the current size for an additional 88% of compression time, but with 10% faster decompression. Worth considering after switching CollecTor to new hardware, I'd say. Re-opening.

comment:9 Changed 5 years ago by karsten

Resolution: fixed
Status: reopenedclosed

Actually, I just finished recompressing tarballs. Old tarballs were 49.03G, new tarballs are 44.09G, so down to 90% of the original size. Resolving.

Note: See TracTickets for help on using tickets.