Hacker News new | ask | show | jobs
by 0xdky 1919 days ago
FWIK, compression works by de-duplication. Finding duplicate patterns is limited to a window.

If similar patterns are close to each other, there is a higher probability of finding such duplicates in the window leading to better compression.

When the files are not sorted, this randomly distributed files with similar patterns beyond the compression window leading to poor compression.

If there is an option to increase the size of window, that would be a good experiment.

This is very similar to `git repack` window and depth parameters. Larger the window and depth, you get better compressed packs.

Wonder if a sort based on diffs (group similar files together) would help get best compression. The cost of such sorting might outweigh the benefits.

1 comments

That’s not the case here. What you describe is basically compression in “solid blocks”. `xz`, however, is (in practice virtually always) fully solid. It allows greater compression efficiency at the expense of random access, which is not possible at all.

The order matters because it “defines” what the algorithm’s dictionary will look like.

I think it's right regarding the limited horizon of the window. Any dynamic compression scheme that evolves knowledge of the input stream will have this property. The tar block being compressed here is many orders of magnitude larger than the window, so the order in which the compressor sees the stream matters.

It is somewhat more complicated because e.g. dynamic Huffan coding rarely applied to raw streams but to a series of one or more encoders such as run-length, but the reasoning remains that if the window or coded/dynamic dictionary size is much larger than the data stream, compression can improve with input permutation.

Back in the 90s I recall an unpopular but very efficient encoder called UltraCompressor[1] which had some input permutation heruristics, "solid" compression, and damage recovery to boot!

[1] http://www.compression.ru/arctest/self/uc.htm