Hacker News new | ask | show | jobs
by shmerl 2363 days ago
Was XZ used in parallelized fashion? Otherwise comparing is kind of pointless. Single threaded XZ decompression is way too slow.
4 comments

A little known fact is that parallel XZ do compress worse than XZ ! I measured pixz as being approximately ~2% worse than xz. That's because input is split into independent chunks.

In comparison, the 0.8% of zstd looks like a bargain.

Is 0.8% with maximum compression? It's surprising the difference is so small.
0.8% is with Arch's default settings. It's fairly strong, but not the strongest, to preserve cpu during compression.

zstd is used at level 20, but it can compress more. Levels can go up to 22, and (complex) advanced commands are available to compress data even more.

Multithreaded xz is non-deterministic and so it's not a candidate.
How is it non deterministic? Works pretty consistently for me with pixz.
The bytes of the compressed file are non deterministic and depend on the number of cores used, system load and other “random” factors.
Can't you set those parameters during compression to something fixed? Should be doable.
Or you can just use zstd.

The xz tool is not deterministic when compressing. The packaging team might change upstream for a few things, but diving into the innards of a compression tool is expecting a bit much.

We are talking about decompression speed and not encryption. Decompression is necessarily deterministic.
The compression speed is also an issue for developers. In many cases the compression step takes longer than the rest of the build.
May be the point is that compressed package can change every time, which is an issue for reproducible builds idea many distros now are using. Though I'm not sure why parallelized xz can't behave in predictable fashion.
No, I mean you don’t need to parallel compress. The compression speeds don’t matter, and are compatible with single- or multi-threaded decompression.
Compression speed can matter in general (to improve build times).

For xz, you need to compress with chunking (and may be indexing for more benefit), in order to allow parallel decompression to begin with. Otherwise xz produces a blob which you can't split into independent parts during decompression, which makes using many decompression threads pointless.

But yes, if parallel compression is creating non determinism, you can do all the compression work with chunking without parallelism, still allowing parallel decompression. But I'm not sure why it even has to create non determinism in the first place.

I give thanks every day for pxz. I can churn out apt indices so much faster relative to the alternative.
For general purposes, I like using pixz which is indexable in comparison: https://github.com/vasi/pixz

Do you know if Debian is using parallelized XZ or not with apt / dpkg?

Maybe worth mention that zstd is happy to work in parallel.