Hacker News new | ask | show | jobs
A (hopefully) new compression algorithm that uses binomials (github.com)
29 points by peter-ebert 720 days ago
10 comments

Looking at it quickly, it seems like you just moving entropy to the frequency table.

If the frequency table isn't included in the count, I could just make an "infinite" compressor by storing the frequency and cut one byte off the end. From the frequency table I could then deduce what the last byte should be.

> typical entropy encoders (Huffman, ANS, etc) would require 1 bit per symbol,

No. ANS (and range/arithmetic coding) allows for probabilities to be stored with fractional bits.

ANS requires 1 bit per symbol if the ratio is 1:1, you can confirm this here: https://kedartatwawadi.github.io/post--ANS/
The value of the ratio is information that needs to be transmitted. You are pretending that it’s free
I was using what I seemed to be the standard with other encoders, storing the ratios is roughly the same cost compared to other compressors, a few extra bits for exact counts. But I do see how this isn't i.i.d. so I'm fixing that.
I don't think you understand the basic concepts and you are wasting your time trying to "fix" it. Let me try to explain a bit differently: The Shannon limit depends on assuming a model of the source (the thing producing symbols). For example imagine the output of something that you think is "true" noise like a fair coin flip that produces 0 (for heads) and 1 (for tails). That has entropy of 1 bit and you cannot compress it. Now you learn that it is actually a pseudorandom number generator that produces 0 or 1 in a 1:1 ratio. If you know the program (and seed if required) the source now has 0 entropy (you can run the program to know every bit that will be produced). Same symbol outputs, different models of how they are produced, different theoretical compression limits. As another example if you are compressing text one model would be to assume that words are produced independently of one another. We know that that isn't true (certain words are more likely to follow other words like your next word prediction system on your phone works) but you might still choose to implement a compression system based on that naive model because it's less complex, uses less memory or compression/decompression time etc. If you implemented compression using a less naive model (e.g. what is the probability of the next symbol in all human text given all the symbols I have seen before this one) you could get better compression rates but you wouldn't be breaking the Shannon limit that you calculated using the naive model because the "true" limit is based on the "true" model of whatever the entropy of human produced text is, which is not really calculable (people produce new sentences every day) but you might approximate using something like all human text you can find.

As I alluded to above note that Shannon entropy doesn't take into consideration practical complexities (memory, program size, processing time) that are real considerations when implementing a compression scheme. Most of the time you are going to trade higher entropy for simpler, faster, less resource intensive compression.

To sum up: You cannot beat the Shannon limit of the "true" model of the source but you can beat the Shannon limit of a naive approximation of the source. Those naive models are used because they are more practical to implement.

I appreciate the input. I did not mean to imply I could encode a random stream of symbols/characters, that is absolutely valid. I was approaching this as a compression technique for something like a text file, where the symbol counts are known at compression time.
Not if the counts are updated correctly.
I was initially surprised to find that the size of multinomials is less than the entropy predicted by the Shannon source coding theorem, which is generally accepted as a lower bound. Not sure if this encoder is new, but hopefully others find it interesting. Step by step math here https://github.com/Peter-Ebert/Valli-Encoding/blob/main/the-...
I don't see why arithmetic coding wouldn't match this if the probabilities are suitably updated after every symbol. If you encode 8 bits, 4:4 ratio, after the first bit is encoded, either 1 or 0 has a remaining count of 3, etc. And when the count of either 0 or 1 reaches 0, the rest of the bits are known.
Here's a simpler example, 2 symbols 1:1 ratio, Shannon would say the entropy is 1 bit per symbol, so this needs 2 bits: 01 or 10 encode both permutations.

However I can also just store 1 or 0 to indicate what's stored in the first position, using only a single bit, and the next value is inferred.

With arithmetic coding this requires 1 bit to store: the value of the first bit. The other is known with 100% probability. This is very simple to implement, especially with bit-wise arithmetic coding.

The problem is that the Shannon limit doesn't apply to your example. A fixed 1:1 ratio bit string is not I.I.D.

This. Optimal use of an arithmetic encoder is with P(symbol | all previous symbols), not just P(symbol). Obviously there are practical problems with storing all possible contingency tables, so for something like HEVC there is quite a lot of effort in identifying where this matters.
You are saying that conditional on some additional information (or structure) you can get better compression. But that is saying something tautological: conditional on the entropy being lower the entropy is lower. In your case you are assuming you already know the symbol frequencies and you don’t count that as information. However you are comparing against a model where that is not assumed.
That's just regular data compression.

If i had a text that was 100% 'aaaa' or 'bbbb' or 'cccc' with equal probability I'd feed 1/3rd probability aaaa, 1/3rd probability bbbb, 1/3rd probability cccc into an encoder. In this case since the probabilities are not binary numbers i'd use an arithmetic encoder to optimally compress this down to ~1.58 bits per symbol. So 'aaaa' would take 1.58bits to store as would 'bbbb' and 'cccc'

Wouldn’t two symbols with a 1:1 ratio have four possible bit patterns for two bits? (00,01,10,11) With the ratio only happening on average over a large number of bits?

Id there really could only be 01 or 10, then those are the two symbols in the alphabet, and you only need one bit to pick the next symbol (two bits of output).

Those two values aren’t independent at all.
The author has built the following encoder:

- A genie hands the encoder the type of an iid source sequence (the function N here: [1])

- The encoder produces a representation of the source with fewer bits than what would be possible using a Shannon-efficient encoder that didn't know the type.

- You also need to know the type to decode.

The reduction in bits from Shannon's theorem is explained by the genie revealing this extra information that reduced the input sequence's entropy (also causing it to become non-iid). The collection of letter-typical sequences of specific type is indeed often going to be smaller than (entropy-rate)^(blocklength), and that is what we are seeing here. This is where the author's explanation starts to go astray:

> The Shannon limit uses Stirling's approximation which is an asymptotic approximation for factorials and as such the approximation is too large for small values and becomes more accurate as the factorial size (i.e. message length and symbol count) approaches infinity.

[1]: https://en.wikipedia.org/wiki/Typical_set#Strongly_typical_s...

I would recommend the author read any information theory textbook, starting at the latest, at the part that covers Asymptotic Equipartition Properties. This is the crucial definition that will make the source coding theorem concrete, in my opinion.

thanks for the constructive feedback!
I think I'm misunderstanding what this does, since a naive reading suggests it's doing something which is pretty easy to prove impossible.

EDIT: Oh. I guess OP isn't including the symbol counts as part of the message when calculating the message length? I guess I can see why this could be useful, but I think this should be made much more explicit.

There’s a lot of handwaving about message sizes which seems fishy to me.

The ‘hidehohedihe’ worked example, the post claims:

> Thus encoding is complete, the final value to store = 311041. This requires 3 bytes / 19 bits to store, while the original message was 8 bytes / 96 bits.

I guess he means 12 bytes here, which seems to be assuming your message is composed of symbols from the full 256 character alphabet.

Given that the decoder needs the exact frequency count of the symbols in the source message (not just a frequency distribution - the original message frequency counts) that means he needs to communicate a series of arbitrary sized integers (sounds like we’ll need ‘universal coding’ - https://en.m.wikipedia.org/wiki/Universal_code_(data_compres... ), the 8 bit characters they correspond to, and the ‘compressed message’ number.

We can maybe terminate the integer list with a zero since there won’t be any entries in the list with a frequency count of 0, and that will automatically tell us how many characters to expect.

So I think that amounts to needing to send: 1,1,2,4,4,0,i,o,d,e,h,311041.

Using Levenshtein coding for the integer list, we need 25 bits for the numbers, 40 for the letters, and 19 bits for the ‘message’, assuming we can detect the end of it somehow. That’s 84 bits in total.

You could literally squeeze ‘hidehohedihi’ into that by just using 7bit ASCII.

Alternatively if I just naively send the n characters in my alphabet as a nul-terminated string (iodeh\0) followed by the string encoded in base(n) - base 5 in this case - we’re looking at what, 48 bits for the character map and 28 bits for the message, for a 76bit message.

With Huffman coding you can omit the lookup table from the compressed message size in many cases because you’re using a common coding table across many messages/files.

In this case the frequency count table is message-specific, though, so you can’t assume the decoder already has access to it.

I suppose if you can maybe force the underlying message to contain an exactly even quantity of all symbols - normalize it with padding until the frequency counts are all the same - you might be able to take advantage of the distribution being ‘preknown’ by the recipient?

As far as I can tell the symbol counts are not included when measuring the size of the output, I did say that but there is a lot of text :) Huffman, ANS, arithmetic coding, etc do not include the frequency counts as part of the size afaik, though of course that needs to be stored to decode, if not please link me.
> Huffman, ANS, arithmetic coding, etc do not include the frequency counts as part of the size afaik, though of course that needs to be stored to decode, if not please link me.

They may or they may not. There are schemes doing both. Either way, you cannot compare arithmetic coding given no information to your scheme that is given the exact frequencies of every symbol.

This arithmetic compression example simply stores the 4 byte values of the frequencies (I used 8 bytes) in a flat table https://github.com/nayuki/Reference-arithmetic-coding/blob/m...

For input3 they used 20 additional bytes while I used 12. Though I know they said their implementation isn't perfect.

That isn't updating the counts after every symbol. Why don't you write one that does so you can compare them fairly?
Sure thing! Does this meet your requirements? https://github.com/nayuki/Reference-arithmetic-coding/blob/m...

If not could you link one that does? If I implement my own I expect someone to say I did it wrong.

*I mean input2, sorry long day.
One's default when seeing a claim to do better than the Shannon limit is to guess that the author is a scammer, confused, or both.

My impression in a quick read of this github page is that this person has no background in channel coding or information theory. Maybe not quite a scammer, but it appears this should be ignored.

It's part of learning. Over excited and some guidance would be helpful but it's learning. He's reinvented an octagonal wheel and we're now able to explain to him 'good start, see this round wheel (arithmetic encoding) over here? it's even better!'.
It's easy to see this octogonal wheel [rolling downhill] is faster than this round wheel [on a flat plain]!
Yeah this was a big fear in posting. No I'm not experienced in academia or this would be a paper, I code things for fun. Have you tried looking at the math for multinomials vs Shannon? Or running the code? Please do point out my mistakes in the math.
It is not our responsibility to do that. You should explain how your method differs from well-established methods such as arithmetic coding (see papers by Rissanen). While doing that, you should quickly realize that you’ve got nothing at all.

A basic data compression course would have sufficed too, but as you say, you don’t have any knowledge of the basics of information theory or data compression. You sound like an over-eager yet ignorant grad student who insists he has found flaws in the professor’s lecture material.

>As an example, given the symbol frequencies of 1, 2, and 3 for a total of 6 symbols. Shannon limit [-(1/6)log2(1/6)-(2/6)log2(2/6)-(3/6)log2(3/6)]6 = 8.75 => 9 bits

This isn't how you calculate Shannon limit fwiw. If all symbols are of equal frequency simply take the total number of different symbols and simply log2(different_symbol_count). That's it.

So say you had 60 different permutations. Each of those is a symbol in entropy encoding. Each individual symbol takes log2(60) bits to store using arithmetic encoding. Which is exactly the statement and correct calculation you had for calculating the Shannon limit in the very next line :). As in the Shannon limit for 60 different symbols is absolutely 5.9bits not 9bits

Instead you have some weird calculation here that seems to take individual probabilities and add them back up again to give 9bits. That's very far off and incorrect.

Thanks I'll correct, I didn't know you could adjust the symbols as you go with arithmetic coding, got an example of that?
I think you could just have a different code for each given index and all the symbols seen until that point.

This seems to do something similar: https://en.wikipedia.org/wiki/Context-adaptive_binary_arithm...

fwiw I was using the same math they did here to get 4.755 bits: https://en.wikipedia.org/wiki/Arithmetic_coding#Sources_of_i... [-(1/3)log2(1/3)-(1/3)log2(1/3)-(1/3)log2(1/3)]*3=4.7548
I think you are confused about probabilities and realized frequencies. A probability is a before-the-fact model. A realized frequency is an after the fact observation. If you flip a fair coin the probability of heads or tails is 50% but any given single flip will have a frequency of 1 for either heads or tails and 0 for the other. (What is true is that as you flip the coin a very large number of times the actual ratio you will see will approach 50/50 but that's definitely not true for say 8 tosses. This is called the asymptotic equipartition principle but it isn't really relevant here).
A better way to do this is plain old arithmetic encoding. It also effectively stores a real number of bits for each symbol which is where the savings here is coming from as you stated.

Ie. you'll get an even better result if you limit plain old arithetic coding to 1/70th probablility of each symbol (70 was stated as the non binary aligning example here) and just encode through using arithmetic encoding. Arithmetic encoding is blisteringly fast too.

https://en.wikipedia.org/wiki/Arithmetic_coding

I'd also say it's not really cheating Shannon limits. It's just that a binary channel likes to align on base two boundaries. Arithmetic coding is the way to store optimally without needing to align on the boundaries. With arithmetic coding at worst you waste 7 bits at the end of the file as you align to 8bits to store.

> Given a message with two symbols in a 1:1 ratio, typical entropy encoders (Huffman, ANS, etc) would require 1 bit per symbol, with 0 representing one symbol and 1 representing the other.

If you make use of the same information required here (exact counts) and update the probability model as you go, the result becomes the same for arithmetic coding/ANS (e.g. even with Huffman, after all symbols of one kind are seen, the remaining symbol is encoded in 0 bits).

So if I'm reading this correctly, this works by knowing how often each symbol occurs ahead of time? (Since the table of frequencies is not counted in the compressed size but required for decompression)
No what he's getting at is that he has 70 unique states per element and binary doesn't align well to that. A 100% proven optimal way to fix this is with arithmetic encoding which is actually a lot faster than the above method and will save even more bits.
Seems like it's the old I can encode all of wikipedia into a single word as long as you have all of wikipedia stored in your decoder.
Yes
If the symbol frequency is known beforehand and fixed for all messages, it's not really i.i.d, as each distribution depends on the previous symbols.
Agree.

In the "For a more concrete example" paragraph, since we're assuming that each byte of 8 bits must have 4 zeros and 4 ones, we always know the 8'th bit after only seeing the first 7. Indeed, many times we would know the 7th and 8th bits after seeing the first 6. So this source is not IID (at the bit-by-bit level) and Shannon's result does not apply there.

As other commenters have said, the source could be IID at the byte level. As noted, it would have 8-choose-4 = 70 symbols. And then Shannon's results would apply to the bytes, not the bits.

The article doesn't give enough information to say whether the source they have in mind is or isn't IID at the byte level. But the obvious choice (all 8-choose-4 symbols are equiprobable) does allow us to compute the Shannon entropy, which is of course always:

  H = - E p(i) log(p(i))
where p(i) is the probability of the i'th symbol, which is always just

  p(i) = 1/(8 choose 4) = 1/70
Of course, then

  H = log(70)
So unsurprisingly the Shannon entropy gives the right answer.

*

Cover and Thomas is really good and intuitive on this. See section 3.2 of:

https://cs-114.org/wp-content/uploads/2015/01/Elements_of_In...

The thing that's wild is that, in the asymptotic case, "everything is in the typical set". The OP is kind of riffing on this, taking it very literally!

Whenever you're compressing a file on your computer the frequencies are known beforehand. Agree this does not apply to noisy-channel coding, only the Shannon source coding theorem in lossless compression. Good point though I'm not sure if this would be considered an entropy encoder, as previous values do have some impact. Any more info I should look at for that?
> Whenever you're compressing a file on your computer the frequencies are known beforehand.

Not at all. The best compressors learn to predict the frequencies as they go, they do not compute them up front and store them in the output.

You still have to store that info and that counts against the number of bytes you are using
Got it, I was thinking of the locations as being i.i.d. in that you do not have any information on their location. You and others are saying that adjusting the symbol counts as you go would give the same performance to other algorithms. Looking into that, thanks.
This can be seen as encoding ranges that span each character. E.g., "asdoasd" needs to encode just 3 for 'o', remove 'o' from "asdoasd" and encode 0 and 2 for 'a', then 0 and 1 for 's' and then only "dd" remains - encode 'd', encode stoppage bit and you are done.

So I implemented that transformed idea in Haskell using rude approximation of some encoding of the integers and got, approximately, 5432177 bits for first 1e6 bytes of enwik9. This translates to 5.43 bits per byte, which is not that bad - order0 encoding gets about 5.56 bits per byte there. My approach can be improved, of course, and improved result can be even smaller.

So, in my not important opinion, it is a good approach. The only drawback I see is that I cannot fathom how to extend that encoder to higher order contexts.

PS

My variant adds counts of runs (count of character appearance), so it does add frequency table implicitly.