|
|
|
|
|
by sebzim4500
720 days ago
|
|
I think I'm misunderstanding what this does, since a naive reading suggests it's doing something which is pretty easy to prove impossible. EDIT: Oh. I guess OP isn't including the symbol counts as part of the message when calculating the message length? I guess I can see why this could be useful, but I think this should be made much more explicit. |
|
The ‘hidehohedihe’ worked example, the post claims:
> Thus encoding is complete, the final value to store = 311041. This requires 3 bytes / 19 bits to store, while the original message was 8 bytes / 96 bits.
I guess he means 12 bytes here, which seems to be assuming your message is composed of symbols from the full 256 character alphabet.
Given that the decoder needs the exact frequency count of the symbols in the source message (not just a frequency distribution - the original message frequency counts) that means he needs to communicate a series of arbitrary sized integers (sounds like we’ll need ‘universal coding’ - https://en.m.wikipedia.org/wiki/Universal_code_(data_compres... ), the 8 bit characters they correspond to, and the ‘compressed message’ number.
We can maybe terminate the integer list with a zero since there won’t be any entries in the list with a frequency count of 0, and that will automatically tell us how many characters to expect.
So I think that amounts to needing to send: 1,1,2,4,4,0,i,o,d,e,h,311041.
Using Levenshtein coding for the integer list, we need 25 bits for the numbers, 40 for the letters, and 19 bits for the ‘message’, assuming we can detect the end of it somehow. That’s 84 bits in total.
You could literally squeeze ‘hidehohedihi’ into that by just using 7bit ASCII.
Alternatively if I just naively send the n characters in my alphabet as a nul-terminated string (iodeh\0) followed by the string encoded in base(n) - base 5 in this case - we’re looking at what, 48 bits for the character map and 28 bits for the message, for a 76bit message.
With Huffman coding you can omit the lookup table from the compressed message size in many cases because you’re using a common coding table across many messages/files.
In this case the frequency count table is message-specific, though, so you can’t assume the decoder already has access to it.
I suppose if you can maybe force the underlying message to contain an exactly even quantity of all symbols - normalize it with padding until the frequency counts are all the same - you might be able to take advantage of the distribution being ‘preknown’ by the recipient?