Hacker News new | ask | show | jobs
by yaantc 3084 days ago
There is better than Huffman coding now, see range coding [1] or arithmetic coding [2], as used in Zstandard [3] for example. In a nutshell Huffman coding is optimal under the constraint that one uses an integer number of bits per coded symbol. This constraint is quite natural so was not stated explicitly, but it's not a necessity. Range / arithmetic coding are using a continuous encoding covering a full sequence of symbols. In the end, this enables having a fractional number of bits per coded symbol, and matching the actual probability of a given symbol. This saving leads to a more compact encoding.

The patents have now expired, which is why the technology starts to appear in open source and royalty free implementations like Zstd.

[1] https://en.wikipedia.org/wiki/Range_encoding [2] https://en.wikipedia.org/wiki/Arithmetic_coding [3] https://facebook.github.io/zstd/

1 comments

Arithmetic coding is also mentioned in the text:

> The optimality of the Huffman code must have seemed at the time to leave little room for further work in fixed-to- variable source coding.[11] That, however, proved not to be the case, because of two major difficulties: 1) the distribution of the source may not be known [12] when the code is designed (Section II-E), and 2) although the Huffman algorithm need not operate symbol by symbol, its complexity grows very rapidly with the length of the source block. ... The second shortcoming is circumvented by the arithmetic coding method of J. Rissanen [60] (generalized in [61] and [62] and popularized in [63]), whose philosophy is related to that of the Shannon–Fano code.

I made two assumption in my comment. 1) s-c-h knows nothing about compression, so would like to start with the easiest code developed along information-theoretical lines, and 2) the author of the IEEE paper knows enough about the topic that I could quote the text directly.

To clarify, the author is https://en.wikipedia.org/wiki/Sergio_Verd%C3%BA and is "the Eugene Higgins Professor of Electrical Engineering at Princeton University, where he teaches and conducts research on Information Theory in the Information Sciences and Systems Group."

I do not have enough knowledge of the field to tell if he included the correct qualifiers. The Huffman code is a "variable-length source code that minimizes average length" while arithmetic codes appear after mentioning the goal of "minimum average length in terms of the distribution of the source."

That appears correct to me.