Hacker News new | ask | show | jobs
by gliptic 720 days ago
I don't see why arithmetic coding wouldn't match this if the probabilities are suitably updated after every symbol. If you encode 8 bits, 4:4 ratio, after the first bit is encoded, either 1 or 0 has a remaining count of 3, etc. And when the count of either 0 or 1 reaches 0, the rest of the bits are known.
1 comments

Here's a simpler example, 2 symbols 1:1 ratio, Shannon would say the entropy is 1 bit per symbol, so this needs 2 bits: 01 or 10 encode both permutations.

However I can also just store 1 or 0 to indicate what's stored in the first position, using only a single bit, and the next value is inferred.

With arithmetic coding this requires 1 bit to store: the value of the first bit. The other is known with 100% probability. This is very simple to implement, especially with bit-wise arithmetic coding.

The problem is that the Shannon limit doesn't apply to your example. A fixed 1:1 ratio bit string is not I.I.D.

This. Optimal use of an arithmetic encoder is with P(symbol | all previous symbols), not just P(symbol). Obviously there are practical problems with storing all possible contingency tables, so for something like HEVC there is quite a lot of effort in identifying where this matters.
You are saying that conditional on some additional information (or structure) you can get better compression. But that is saying something tautological: conditional on the entropy being lower the entropy is lower. In your case you are assuming you already know the symbol frequencies and you don’t count that as information. However you are comparing against a model where that is not assumed.
That's just regular data compression.

If i had a text that was 100% 'aaaa' or 'bbbb' or 'cccc' with equal probability I'd feed 1/3rd probability aaaa, 1/3rd probability bbbb, 1/3rd probability cccc into an encoder. In this case since the probabilities are not binary numbers i'd use an arithmetic encoder to optimally compress this down to ~1.58 bits per symbol. So 'aaaa' would take 1.58bits to store as would 'bbbb' and 'cccc'

Wouldn’t two symbols with a 1:1 ratio have four possible bit patterns for two bits? (00,01,10,11) With the ratio only happening on average over a large number of bits?

Id there really could only be 01 or 10, then those are the two symbols in the alphabet, and you only need one bit to pick the next symbol (two bits of output).

Those two values aren’t independent at all.