Hacker News new | ask | show | jobs
by joshsabol46 1106 days ago
Please pardon my ignorance...

My understanding is that there are 1000s of different compression algorithms, each with their own pros/cons dependent on the type and characteristics of the file. And yet we still try to pick the "generically best" codec for a given file (ex. PNG) and then use that everywhere.

Why don't we have context-dependent compression instead?

I'm imagining a system that scans objects before compression, selects the optimal algorithm, and then encodes the file. The selected compression algorithm could be prefixed for easy decompression.

Compare a single black image that's 1x1 to one that's 1000x1000. PNGs are 128bytes and 6KB, respectively. However, Run Length Encoding would compress the latter to a comparable size as the former.

6 comments

There are a variety of use cases that dictate which algorithm is going to perform best. For example, you might use Zstandard -19 if you are compressing something once and transferring it over a slow network to millions of people. You might use LZ4 if you are generating a unique large piece of data interactively for thousands of concurrent users, because it compresses faster than Zstandard. Basically, if you're constrained by network bandwidth, Zstandard; if you're constrained by CPU, LZ4.

There are then legacy formats that have stuck around long past their sell-by date, like gzip. People are used to using gzip, so you see it everywhere, but it's slower and compresses worse than Zstandard, so there is no reason why you'd ever use it except for compatibility with legacy systems. (bzip2, 7z, xz, snappy, etc. also live in this "no reason to use in 2023" space.)

Take a look at performance measurements here: https://jolynch.github.io/posts/use_fast_data_algorithms/. For example, gzip can get a compression ratio of 0.41 at 21MiB/s, while Zstandard does 0.38 (better) at 134MiB/s. (Meanwhile, lz4 produces outputs nearly twice as large as Zstandard, but compresses almost 3x faster and decompresses 2.5x faster.)

Lossy compression is even more complicated because the compression algorithms take advantage of "nobody will notice" in a way that's data dependent; so music, video, and photographs all have their own special algorithms.

Your link seems to compare GNU gzip with zstd. When comparing file formats, I would compare the best software for that file format. igzip: https://github.com/intel/isa-l can decompress consistently faster than GNU gzip. Depending on the file, it decompresses 2-3x faster making it almost as fast as zstd decompression. I have less experience with compression benchmarks. A quick benchmark on Silesia shows igzip to be ~7x faster but it sacrifices 10% of compression ratio for that even on its highest compression setting. It seems to be optimized for speed.
There is some approach you can use sort of like this within PNGs themselves I think. It's been a while so I might be misrembering but effectively each "row" of data that is compressed can be encoded as the difference from the preceding row using 4 or 5 different operations.

You can achieve better compression by brute forcing the possible operations used to encode the rows to find the "most compressible" output. Not quite what you meant but sort of similar in that you try multiple approaches and pick the best.

I gave up before implementing it but in the stub I left this comment to myself " A heuristic approach is to use adaptive filtering as follows: independently for each row, apply all five filters and select the filter that produces the smallest sum of absolute values per row.".

In addition more similar to your approach PDFs support many compression filters for objects internally like RLE and ZIP so you can choose the best algorithm per object but generally it's quicker just to ZIP everything.

> There is some approach you can use sort of like this within PNGs themselves I think. [...] You can achieve better compression by brute forcing the possible operations used to encode the rows to find the "most compressible" output.

That's the approach used by tools like optipng and pngcrush.

> I gave up before implementing it but in the stub I left this comment to myself " A heuristic approach is to use adaptive filtering as follows: independently for each row, apply all five filters and select the filter that produces the smallest sum of absolute values per row.".

The same idea can be found in the PNG standard itself: "For best compression of truecolor and grayscale images, we recommend an adaptive filtering approach in which a filter is chosen for each scanline. The following simple heuristic has performed well in early tests: compute the output scanline using all five filters, and select the filter that gives the smallest sum of absolute values of outputs. (Consider the output bytes as signed differences for this test.) This method usually outperforms any single fixed filter choice. However, it is likely that much better heuristics will be found as more experience is gained with PNG." (quoted from the PNG specification, version 1.2)

> selects the optimal algorithm

Here's the catch: how does the "system" know which algorithm would be the best? It could try encoding it with multiple algorithms and see which one is shorter, but that's extra CPU.

And the "system" can be called acompression algorithm itself.

I mean yeah that's basically what high compression solutions like paq have done, depending on the compression level desired apply increasingly speculative and computationally intensive models to the block and pick whichever one worked the best.
And then, when nobody wants to implement all the compression algorithms in a compressor or decompressor, we end up with files out in the wild that only pick one of them anyway.
OpenZFS' integration of Zstandard uses LZ4 as a "compression canary" for higher ZStandard compression levels, where they feed the data blocks through LZ4 and if it compresses it enough, feeds it through Zstandard.

This relies on LZ4 being very fast, especially with it's early-exit on incompressible data.

Overall this turns out to be a win, you lose a little bit of compression at a huge decrease in CPU over just using the same Zstandard compression for all the blocks.

It would still pay off in many situations. With existing algorithms you can already optimize image files to be compressed as much as possible, which takes quite a bit longer than usual but if it means 30% smaller files for an entire website that has an impact on every visitor.
WinRAR used to support specialized compression algorithms for specific kinds of files, like uncompressed bitmaps, sounds, executables, or plain text. However, the feature was removed in the RAR5 file format 10 years ago. Maybe it was not worth it?
I listened to a talk once where they said that this is sort of how AV1 compression works, where there are many possible ways to compress a block, and you can choose the best one at runtime by computing each and seeing which is best.
Most image codecs are made for natural images; those of the real world, not synthetic ones like the one you proposed. Lossy codecs like JPEG use perceptual coding to maximize perceived quality for a given file size.