Hacker News new | ask | show | jobs
by sebzim4500 720 days ago
I think I'm misunderstanding what this does, since a naive reading suggests it's doing something which is pretty easy to prove impossible.

EDIT: Oh. I guess OP isn't including the symbol counts as part of the message when calculating the message length? I guess I can see why this could be useful, but I think this should be made much more explicit.

2 comments

There’s a lot of handwaving about message sizes which seems fishy to me.

The ‘hidehohedihe’ worked example, the post claims:

> Thus encoding is complete, the final value to store = 311041. This requires 3 bytes / 19 bits to store, while the original message was 8 bytes / 96 bits.

I guess he means 12 bytes here, which seems to be assuming your message is composed of symbols from the full 256 character alphabet.

Given that the decoder needs the exact frequency count of the symbols in the source message (not just a frequency distribution - the original message frequency counts) that means he needs to communicate a series of arbitrary sized integers (sounds like we’ll need ‘universal coding’ - https://en.m.wikipedia.org/wiki/Universal_code_(data_compres... ), the 8 bit characters they correspond to, and the ‘compressed message’ number.

We can maybe terminate the integer list with a zero since there won’t be any entries in the list with a frequency count of 0, and that will automatically tell us how many characters to expect.

So I think that amounts to needing to send: 1,1,2,4,4,0,i,o,d,e,h,311041.

Using Levenshtein coding for the integer list, we need 25 bits for the numbers, 40 for the letters, and 19 bits for the ‘message’, assuming we can detect the end of it somehow. That’s 84 bits in total.

You could literally squeeze ‘hidehohedihi’ into that by just using 7bit ASCII.

Alternatively if I just naively send the n characters in my alphabet as a nul-terminated string (iodeh\0) followed by the string encoded in base(n) - base 5 in this case - we’re looking at what, 48 bits for the character map and 28 bits for the message, for a 76bit message.

With Huffman coding you can omit the lookup table from the compressed message size in many cases because you’re using a common coding table across many messages/files.

In this case the frequency count table is message-specific, though, so you can’t assume the decoder already has access to it.

I suppose if you can maybe force the underlying message to contain an exactly even quantity of all symbols - normalize it with padding until the frequency counts are all the same - you might be able to take advantage of the distribution being ‘preknown’ by the recipient?

As far as I can tell the symbol counts are not included when measuring the size of the output, I did say that but there is a lot of text :) Huffman, ANS, arithmetic coding, etc do not include the frequency counts as part of the size afaik, though of course that needs to be stored to decode, if not please link me.
> Huffman, ANS, arithmetic coding, etc do not include the frequency counts as part of the size afaik, though of course that needs to be stored to decode, if not please link me.

They may or they may not. There are schemes doing both. Either way, you cannot compare arithmetic coding given no information to your scheme that is given the exact frequencies of every symbol.

This arithmetic compression example simply stores the 4 byte values of the frequencies (I used 8 bytes) in a flat table https://github.com/nayuki/Reference-arithmetic-coding/blob/m...

For input3 they used 20 additional bytes while I used 12. Though I know they said their implementation isn't perfect.

That isn't updating the counts after every symbol. Why don't you write one that does so you can compare them fairly?
Sure thing! Does this meet your requirements? https://github.com/nayuki/Reference-arithmetic-coding/blob/m...

If not could you link one that does? If I implement my own I expect someone to say I did it wrong.

Sorry, I (and others) have explained several times what an equivalent arithmetic coding scheme would have to do to make the same assumptions (and have the same limitations) as your scheme.

It would have to assume there's _exactly_ N ones and N zeros in the bit string. Start with those counts:

remaining ones = N;

remaining zeros = N;

It would encode bit for bit. After every zero, it decrements the remaining number of zeros. After every one, it decrements the remaining number of ones. The probability of the next bit being zero or one is derived from the known remaining number of each, P(one) = (remaining ones / (remaining ones + remaining zeros)). When one of the counts reaches zero, the probability of the other becomes 100% and no more information need to be encoded.

Does the linked code do that? No.

I don't have a ready-made example because this is not an assumption about the input that is useful in the majority of cases. The linked example can encode _any sequence_, not just ones that have an equal number of ones and zeros.

If you spent time implementing this scheme you would learn a lot about this field.

*I mean input2, sorry long day.