It's only 2% larger, but I thought zstd was meant to produce smaller consensuses than gzip?
Or did I get the compression settings wrong? Summary: A zstd-compressed cached-microdesc-consensus is 1.5% larger than a gzipped one to A zstd-compressed cached-microdesc-consensus is 2% larger than a gzipped one
This gives us this memory usage for compression, assuming that the calculations in our files are approximately right.
= compression_level_t =
= zlib KB (approx) =
= zstd KB usage (approx) =
BEST
386
10880
HIGH
258
10880
MEDIUM
98
9856
LOW
42
8832
and this compressed output size (measured in a hacked Tor):
= compression_level_t =
= zlib consensus size =
= zstd consensus size =
BEST
525841
492916
HIGH
526470
492916
MEDIUM
578218
495020
LOW
663334
496860
Hm. It looks like, if our numbers are right, zstd is far more memory-hungry than gzip is. That's fine for precompression, but for streaming usage, we should probably tune our zstd parameter choices.
Trac: Summary: A zstd-compressed cached-microdesc-consensus is 2% larger than a gzipped one to Tune zstd parameters to decrease memory usage during streaming
I've run some initial experimentation, and here's what I found:
Adjusting the pre-set values shouldn't be necessary if instead we tell zstd to build its own parameters (using ZSTD_getCParams() or ZSTD_getParams), with the estimatedSrcSize argument to tell zstd how big we expect the input to be.
I also think that our current estimates are higher than zstd actually uses, which is a good thing. I'm attaching a python script that I used for these tests; it requires the "zstandard" package.
Now the catch here is that we can't actually adjust the parameters to anything besides the presets unless we use the "advanced" (a.k.a "static-only" zstd APIs). I've opened ticket #25162 (moved) about doing that safely. But the complexity is enough that I think we should call this an 0.3.4.x ticket: it is more than simply tweaking a couple of numbers.
Trac: Milestone: Tor: 0.3.3.x-final to Tor: 0.3.4.x-final