Hacker News new | ask | show | jobs
by leftrightupdown 4342 days ago
so based on this, if someone wanted best compression program they would choose paq?
3 comments

By the nature of things, anything that compresses some input data must necessarily lengthen other input data, since you can't get away from the fact that there are only so many input files that can be represented by the output number of bits. In fact, it will almost certainly lengthen many more of the possible inputs than it shortens.

I once heard someone describe compression programs as 'expansion programs with interesting failure cases', and so, of course, the best compression program to use depends on exactly which failure cases you're interested in.

While true this doesn't seem to be a practical issue. Any uncompressable data can be encoded using only one bit of overhead, where the bit is a flag indicating whether the rest of the data is compressed. In practice, there is a header and a field indicating which compression method to use. You pay for the size of the header. Adding support for another compression method is nearly free as far as space is concerned; one byte can switch between 256 of them. (Time is another matter.)
Depending on how many times you're going to decompress the same data and the bandwidth you'll use to transmit the compressed data, different things will be "best". http://fastcompression.blogspot.com/p/compression-benchmark....
It depends on what sort of data you're compressing.
PAQ (particularly ZPAQ) is pretty good at most things because it selects the model which works best. Variants of PAQ tend to be the Hutter Prize winners (paqh) - PPM derivatives are generally excellent at text, source code, HTML, and things with that kind of word-like symbol distribution (which is why PPMd.H in particular is used by RAR and 7-zip for text compression, although RAR selects it automatically and unfortunately 7-zip doesn't seem to, probably because of the extra RAM overhead for decompression that PPMd.H introduces).

However, PAQ tends to be really* slow, largely because it tries more things. It's highly tunable, but people who aren't compression geeks tend to not want to tune their compression. Presets are available.

That's pretty much why it hasn't caught on - speed. There may be some hybrid approaches that deliver a better compromise between context mixing's effectiveness and dictionary coding's speed and memory usage: I guess you could argue LZMA, bringing a Markov-chain algorithm into the mix, is one such, in a way. Sort of.

I'm also a little antsy about ZPAQ formats containing bytecode descriptions of the decompression algorithm needed, and are, broadly speaking, executable (and in some cases, are). That seems like the kind of thing that may invite security problems if approached without due caution.