| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jameshart 767 days ago

There’s a lot of handwaving about message sizes which seems fishy to me.

The ‘hidehohedihe’ worked example, the post claims:

> Thus encoding is complete, the final value to store = 311041. This requires 3 bytes / 19 bits to store, while the original message was 8 bytes / 96 bits.

I guess he means 12 bytes here, which seems to be assuming your message is composed of symbols from the full 256 character alphabet.

Given that the decoder needs the exact frequency count of the symbols in the source message (not just a frequency distribution - the original message frequency counts) that means he needs to communicate a series of arbitrary sized integers (sounds like we’ll need ‘universal coding’ - https://en.m.wikipedia.org/wiki/Universal_code_(data_compres... ), the 8 bit characters they correspond to, and the ‘compressed message’ number.

We can maybe terminate the integer list with a zero since there won’t be any entries in the list with a frequency count of 0, and that will automatically tell us how many characters to expect.

So I think that amounts to needing to send: 1,1,2,4,4,0,i,o,d,e,h,311041.

Using Levenshtein coding for the integer list, we need 25 bits for the numbers, 40 for the letters, and 19 bits for the ‘message’, assuming we can detect the end of it somehow. That’s 84 bits in total.

You could literally squeeze ‘hidehohedihi’ into that by just using 7bit ASCII.

Alternatively if I just naively send the n characters in my alphabet as a nul-terminated string (iodeh\0) followed by the string encoded in base(n) - base 5 in this case - we’re looking at what, 48 bits for the character map and 28 bits for the message, for a 76bit message.

With Huffman coding you can omit the lookup table from the compressed message size in many cases because you’re using a common coding table across many messages/files.

In this case the frequency count table is message-specific, though, so you can’t assume the decoder already has access to it.

I suppose if you can maybe force the underlying message to contain an exactly even quantity of all symbols - normalize it with padding until the frequency counts are all the same - you might be able to take advantage of the distribution being ‘preknown’ by the recipient?