Hacker News new | ask | show | jobs
by neon5077 819 days ago
Look up Huffman codes on Wikipedia.

In short, you put the most likely or most common values at the front of the dictionary. Values 0-255 take up one byte, 256-65535 take two bytes, etc. The lower your index, the fewer bits are needed to represent the corresponding state.

2 comments

You're missing an important detail of Huffman codes: they need to be prefix-free so you can actually tell when one value ends and another begins.

A simplified example: suppose you have four letters in your alphabet that you want to represent in binary. You can't just represent these as A=0, B=1, C=10, and D=01, since it's impossible to tell whether "01011" is meant to represent ABABB or DDA or ACBB.

(Hamming codes on the other hand are intended for error correction, where you want maximally distant code words to minimize the chance that bitflips cause decoding errors.)

I don't really think that's too relevant to the core concept of using fewer bits for the most common codes.

That's really a problem inherent to binary streams in general, not just Huffman encoding.

It was a minor point re: this claim:

> Values 0-255 take up one byte, 256-65535 take two bytes

If you wanted to encode more than 256 values, then at best you'd be able to specify values 0-254 taking one byte, e.g. if you used 0xff as a special prefix to represent "use another byte for all other values", but that particular encoding means that you'd only be able to encode another 256 values with a second byte.

I think you mean Huffman codes, not Hamming codes.
Yup.