| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by eesmith 3084 days ago

Shannon's original theory does not result in an optimal encoding. Page 2059 of the linked-to text (the third display page) says:

> But, is that the best we can do? Shannon’s Theorem 3 does not address that question, since it only suggests a suboptimal code. (The optimal code of rate R simply disregards all but the 2^(NR) most probable sequences of length N.) Shannon finds the answer in Theorem 4: as long as we require probability of error strictly less than 1, asymptotically, we cannot encode at rates below the entropy. This statement is commonly known as the strong converse source coding theorem. The converse (or weak converse) source coding theorem asserts that error probability cannot vanish if the compression rate is below the entropy.

Later on the same page:

> The variable-length source code that minimizes average length was obtained by D. Huffman

So what you want is to learn about how Huffman coding works. There are many such examples - it's common in data structures courses and text books. One example is in the classic "Structure and Interpretation of Computer Programs" at https://mitpress.mit.edu/sicp/full-text/sicp/book/node41.htm... . Look around and you'll find plenty more.

There are also many code examples at at https://rosettacode.org/wiki/Huffman_coding , and worked-out examples at StackOverflow, like https://stackoverflow.com/questions/11587044/how-can-i-creat... .

The Wikipedia page at https://en.wikipedia.org/wiki/Huffman_coding seems unhelpful for a novice.

1 comments

yaantc 3084 days ago

There is better than Huffman coding now, see range coding [1] or arithmetic coding [2], as used in Zstandard [3] for example. In a nutshell Huffman coding is optimal under the constraint that one uses an integer number of bits per coded symbol. This constraint is quite natural so was not stated explicitly, but it's not a necessity. Range / arithmetic coding are using a continuous encoding covering a full sequence of symbols. In the end, this enables having a fractional number of bits per coded symbol, and matching the actual probability of a given symbol. This saving leads to a more compact encoding.

The patents have now expired, which is why the technology starts to appear in open source and royalty free implementations like Zstd.

[1] https://en.wikipedia.org/wiki/Range_encoding [2] https://en.wikipedia.org/wiki/Arithmetic_coding [3] https://facebook.github.io/zstd/

eesmith 3084 days ago

Arithmetic coding is also mentioned in the text:

> The optimality of the Huffman code must have seemed at the time to leave little room for further work in fixed-to- variable source coding.[11] That, however, proved not to be the case, because of two major difficulties: 1) the distribution of the source may not be known [12] when the code is designed (Section II-E), and 2) although the Huffman algorithm need not operate symbol by symbol, its complexity grows very rapidly with the length of the source block. ... The second shortcoming is circumvented by the arithmetic coding method of J. Rissanen [60] (generalized in [61] and [62] and popularized in [63]), whose philosophy is related to that of the Shannon–Fano code.

I made two assumption in my comment. 1) s-c-h knows nothing about compression, so would like to start with the easiest code developed along information-theoretical lines, and 2) the author of the IEEE paper knows enough about the topic that I could quote the text directly.

To clarify, the author is https://en.wikipedia.org/wiki/Sergio_Verd%C3%BA and is "the Eugene Higgins Professor of Electrical Engineering at Princeton University, where he teaches and conducts research on Information Theory in the Information Sciences and Systems Group."

I do not have enough knowledge of the field to tell if he included the correct qualifiers. The Huffman code is a "variable-length source code that minimizes average length" while arithmetic codes appear after mentioning the goal of "minimum average length in terms of the distribution of the source."

That appears correct to me.