Hacker News new | ask | show | jobs
by joseraul 3573 days ago
Indeed ANS is difficult: it is the biggest innovation in compression in the last 20 years. Its author has some nice but dense slides about it.

https://dl.dropboxusercontent.com/u/12405967/ANSsem.pdf

Not sure exactly when repcodes were invented. Igor Pavlov has already used them in 7zip.

1 comments

The best TL;DR of ANS is something like this (without being too wrong). It's still too long:

Huffman requires at least one bit to represent any symbol, because it finds unique prefix codes for every symbol by varying the leading bits.

Arithmetic coding encodes symbols as fractional numbers of bits, by using binary fractions. It divides up the range to make this work. In the end, you get one fraction per "message" that can be decoded back into the message (IE a fraction like 0.53817213781271231 ....)

range coding is similar, it just uses integers instead of floating point. You get one number that can be decoded back into the message (IE a number like 12312381219129123123).

Note that in both of these, you have not changed the number system at all. The number of even numbers and odd numbers still has the same density.

Another way to look at the above is based on what a bit selects.

In huffman, a single bit is generally not enough to select something, the prefixes are long. So you need to walk a bunch of bits to figure out what you've got.

In arithmetic or range coding, a single bit selects a very large range. They change the proportions of those ranges, but at some point, something has to describe that range. This is because it's a range. Knowing you have gotten to the 8 in 0.538 doesn't tell you anything on it's own, you need to know "the subrange for symbol a is 0.530 ... 0.539", so it's a. So you have to transmit that range.

ANS is a different trick. Instead of encoding things using the existing number system, it changes the number system. That is, it redefines the number system so that our even and odd numbers are still uniformly distributed but have different densities. If you do this in the right way, you end up with a number system that lets you only require one number to determine the state, instead of two (like in a range).

With regular arithmetic coding and Huffman coding you don't need to send over a dictionary. Instead, you can have an adaptive model that learns to compress the data as it goes (e.g. keeps running track of symbol frequencies), and it will still be reversible.

I thought this wasn't possible with ANS. Or has this changed?

It depends if you have static or adaptive coder.

Static is much cheaper, uses the same probabilities for the entire data block (e.g. 30 kB), probabilities are stored in the header - practically all Huffman and tANS compressors (however, there are considered exceptions: https://en.wikipedia.org/wiki/Adaptive_Huffman_coding ).

Adaptive can start with e.g. uniform probability (no need to store in header) and learns on the way - it is more costly but gives better compression, used with arithmetic coding or rANS. See https://fgiesen.wordpress.com/2015/05/26/models-for-adaptive... https://fgiesen.wordpress.com/2015/12/21/rans-in-practice/

I know that's what adaptive coding means, I thought that was impossible with rANS. All the descriptions of rANS I could find (back when I looked at it) described it terms of a static model, and I ran into problems trying to generalize it to an adaptive one.

Thanks for the resources.