Hacker News new | ask | show | jobs
by wpietri 1370 days ago
Having read through the (very good!) Wikipedia explanations a few times, I'm left with a different troubling intuition: if this works to improve compression, then I feel like there must be something wrong with the compression algorithms!
2 comments

General-purpose compression algorithms tend to be conceptually simple. They must be fast, so they can't try many different approaches to see what works best. And because they are general-purpose, they can't assume much about the data.

The typical data compression algorithm starts with a combinatorial transformation of the data. Then it creates a statistical model of the transformed data and encodes it according to the model. There are two successful families of combinatorial transformations: Lempel–Ziv and Burrows–Wheeler. Neither of them is superior to the other, but the compression performance depends on the properties of the data.

Lempel–Ziv parsing replaces copies of repeated substrings with references to an earlier copy. It works best when there are long repeated substrings in the data. The modeling/encoding parts of an LZ compressor are usually pretty simple.

Burrows-Wheeler transform reorders the symbols according to the context they occur in. It's often but not always followed by run-length encoding. Because the symbols are reordered by context, BWT-based compressors work best when there are many copies of each repeated substring.

Unlike the LZ, the BWT does not compress the data on its own. It relies entirely on statistical compression. Because the symbols are ordered by context, the BWT is an order-n model for every n simultaneously. Hence you get something similar to PPM by simply partitioning the BWT into blocks and encoding each block independently with an order-0 encoder.

Not necessarily. After the transform, there are more runs of characters, but it could be that this comes at the expense of recognizable patters (like common words) in the input. It likely helps simple compression algorithms more than sophisticated ones.