| HN Mirror

General-purpose compression algorithms tend to be conceptually simple. They must be fast, so they can't try many different approaches to see what works best. And because they are general-purpose, they can't assume much about the data.

The typical data compression algorithm starts with a combinatorial transformation of the data. Then it creates a statistical model of the transformed data and encodes it according to the model. There are two successful families of combinatorial transformations: Lempel–Ziv and Burrows–Wheeler. Neither of them is superior to the other, but the compression performance depends on the properties of the data.

Lempel–Ziv parsing replaces copies of repeated substrings with references to an earlier copy. It works best when there are long repeated substrings in the data. The modeling/encoding parts of an LZ compressor are usually pretty simple.

Burrows-Wheeler transform reorders the symbols according to the context they occur in. It's often but not always followed by run-length encoding. Because the symbols are reordered by context, BWT-based compressors work best when there are many copies of each repeated substring.

Unlike the LZ, the BWT does not compress the data on its own. It relies entirely on statistical compression. Because the symbols are ordered by context, the BWT is an order-n model for every n simultaneously. Hence you get something similar to PPM by simply partitioning the BWT into blocks and encoding each block independently with an order-0 encoder.