Hacker News new | ask | show | jobs
by userbinator 638 days ago
Besides a persistent off-by-one error, and the use of actual trees instead of table lookup for canonical Huffman, this is a pretty good summary of the LZ+Huffman process in general; and in the 80s through the mid 90s, this combination was widely used for data compression, before the much more resource-intensive adaptive arithmetic schemes started becoming popular. It's worth noting that the specifics of DEFLATE were designed by Phil Katz, of PKZIP fame. Meanwhile, a competing format, the Japanese LZH, was chosen by several large BIOS companies for compressing logos and runtime code.

Note that real-world GZIP decoders (such as the GNU GZIP program) skip this step and opt to create a much more efficient lookup table structure. However, representing the Huffman tree literally as shown in listing 10 makes the subsequent decoding code much easier to understand.

Is it? I found the classic tree-based approach to become much clearer and simpler when expressed as a table lookup --- along with the realisation that the canonical Huffman codes are nothing more than binary numbers.

4 comments

I'd agree with TFA that canonical Huffman, although interesting, would be yet another thing to explain, and better left out of scope, but it does raise a question:

In what other areas (there must be many) do we use trees in principle but sequences in practice?

(eg code: we think of it as a tree, yet we store source as a string and run executables which —at least when statically linked— are also stored as strings)

> In what other areas (there must be many) do we use trees in principle but sequences in practice?

Heapsort comes to mind first.

Author here! I actually did a follow-up where I looked at the table-based decoding: https://commandlinefanatic.com/cgi-bin/showarticle.cgi?artic...
>before the much more resource-intensive adaptive arithmetic schemes started becoming popular

The biggest problem was software-patent stuff nobody wanted to risk before they expired.

The lookup table here would be indexed by a fixed number of lookahead bits, so it for example would duplicate shorter codes and put longer codes into side tables. So a tree structure, either represented as an array or a pointer-chasing structure would be much simpler.