| HN Mirror

The issue talks about one vs. multiple frames. That's exactly the issue. It's not a matter of complexity, it's a matter of bad compromises.

The issue can be easily played through. The most simplistic encoding where the issue happens is RLE (run length encoding).

Say we have 1MB of repeated 'a'. Originally 'aaa....a'. We now encode it as '(length,byte)', so the stream turns into (1048576,'a').

Now we would want to parallelize it over 16 cores. So we split the 1MB into 16 64k chunks and compress each chunk independently. This works but is ~16x larger.

Similar things happen for window based algorithms. We encode repeated content as (offset,length), referencing older occurrences. Now imagine 64k of random data, repeated 16 times. The parallel version can't compress anything (16x random data), the non-parallel version will compress it roughly 16:1.

There is a trick to avoid this downside. The lookup is not unlimited, there is a maximum window size to limit memory usage. For compatibility it's 8MB for zstd (at level 19), but you can go all the way to 2GB (ultra, 22, long=31). As you make chunks significantly larger than the window you are only loosing out on the new ramp up. E.g. if you use 80MB chunks then you have a bit less than 10% of the file encoded worse. You could still double your encoded size with a well crafted file. If you don't care about parallel decompression then you are able to only parallelize parts like the lookup search. This gives good speedup, but only on compression. That's the current parallel compression approach in most cases (iirc) leading to a single frame, just faster. The problem is that back-references can only be resolved backwards.

The whole problem is not implementation complexity. It's something you algorithmically can't do with current window based approaches without significant tradeoffs on memory consumption, compression ratio and parallel execution.

For bzip2 the file is always chunked at 900kb boundaries at most. Each block is encoded independently and can be decoded independently. It avoids this whole tradeoff issue altogether.

I would also disagree with "no need". Zstd easily outperforms tar, but even my laptop SSD is faster than the zstd speed limits. I just don't have the _external_ connectivity to get something onto my disk fast enough. I've also worked with servers 10 years ago where the PCIe bus to the RAID card was the limiting factor. Again easily exceeding the speed limits.

Anyway, as mentioned a few times it's an odd corner case. And one can't go wrong by choosing zstd for compression. But it is real fun to dig into these issues and look at them, I hope this sparks some interest in it!