Hacker News new | ask | show | jobs
by sidewndr46 345 days ago
What are you compressing with zstd? I had to do this recently and the "xz" utility still blows it away in terms of compression ratio. In terms of memory and CPU usage, zstd wins by a large margin. But in my case I only really cared about compression ratio
4 comments

people tend to care about decompression speed - xz can be quite slow decompressing super compressed files whereas zstd decompression speed is largely independent of that.

People also tend to care about how much time they spend on compression for each incremental % of compression performance and zstd tends to be a Pareto frontier for that (at least for open source algorithms)

This makes sense. A lot of end-users have internet speeds that can outpace the decompression speeds of heavily compressed files. Seems like there would be an irrational psychological aspect to it as well.

Unfortunately for the hoster, they either have to eat the cost of the added bandwidth from a larger file or have people complain about slow decompression.

Well the difference is quite a bit more manageable in practice since you’re talking about single digit space difference vs a 2-100x performance in decompression.
I definitely agree, I basically have unlimited time and unlimited CPU for decompressing. Available memory is huge too. The gains from xz were significant enough that I went with it.
I usually see zstd on max settings outperform xz on speed and very slightly on compression (though that's a tiny difference).
in my experience using zstd --long --ultra -22 gives marginally better compression ratio than xz -9 while being significantly faster
I think it depends on what you're compressing. I experimented with my data full of hex text xml files. xz -6 is both faster and smaller than zstd -19 by about 10%. For my data, xz -2 and zstd -17 achieve the same compressed size but xz -2 is 3 times faster than zstd -17. I still use xz for archive because I rarely needs to decompress them.
Try combining it with --long

My use cases are usually source code, SQL dumps and log files.

Sometimes xz gave marginally better results, but difference was well below 1%

thanks for the tips. As my data has very low entropy, both can compress down to 3-4% of original size, but xz is a lot faster in compression.

raw size: 9612344 B

zstd --ultra -22 --long=31 => 376181 B (3.91% original, 4.088s compress, 0.013s decompress)

xz -z -9 xml => 353700 B (3.68% original, 0.729s compress, 0.032s decompress)

zstd -17 --long=31 could match the compression time of xz, but the size is bigger (405602 B, 4.22% original)

If you compare only the compressed size (not to the original size), .zst would be about 6-15% larger than .xz

do you have examples where xz 'blows it away', not just zstd -3?
Here are some examples of what I was doing in one case

https://www.hydrogen18.com/blog/apk-the-strangest-format.htm...

I was running "zstd --ultra --threads=0" which I assumed was asking it for the absolute maximum

I think your mistake was to use --ultra without a compression level.

I redid your experiments with rust-wasm-1.83.0-r0.apk:

                            size       perc   c.time  d.time
    uncompressed:      290072064          -        -
    gzipped original:  105255109     36.29%        -  
    bzip2 -9:          107099379     36.92%    21.1s  11.0s
    bzip3 -b511:        73539847     25.35%    28.9s  32.0s
    xz --extreme -9:    71010672     24.48%   142.0s   3.1s
    lzip -9:            70964413     24.46%   173.5s   5.3s
    zstd --ultra -22:   48288499     16.64%   155.6s   0.4s
It's pretty clear zstd blows everything else out of the water by a huge margin. And even though compressing with zstd is slightly slower than xz in this case (by less than 10%), decompression is nearly 8x as fast, and you can probably tweak the compression level to make zstd be both faster and better than xz.
That was an impressive result, so I tried it on a huge email inbox.

    uncompressed:    1512662084
    xz --extreme -9:  508431572  12:47
    zstd --ultra -21: 508432560  12:44
(-22 ran out of memory.) So at least by me zstd was identical to xz almost to the byte and the second.
It does really vary based on the data set.

If the email data is mostly text with markup (like HTML/XML), you might want to try bzip3 too.

It's also possible that a large part of your email is actually already-compressed binary data (like PDFs and images) possibly encoded in base-64. In that case it's likely that all tools are pretty good at compressing the text and headers, but can do little to compress the attachments, which would explain why the results you get are so close.

    bzip3 -b511: 580771424  8:51
I suspect your theory about compressed attachments is correct, although bzip3 isn't doing very well compared to the rest.
I got -22 to run:

    zstd --ultra -22: 494517545 14:00
Pretty minor difference.
I guess I misunderstood the man page for that option then.
yup, you should have tried just different -NN, and notice. I had a talk on zstd couple of years back, and one of the points was that it was better than xz across the board.