| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fwip 269 days ago

Using this example paragraph, at compression level 1 or higher (copy with the quotation symbols):

“It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair.”

The red bit at the beginning is Zlib header information and parameters. This basically tells the decoder the format of the data coming up, how big the data is, etc.

The following grey section is the huffman coding tables - more common characters in the input are encoded in a fewer number of bits. This is what later tells the decoder, that 000 means 'e' and 1110110 means 'I'.

Getting into the content now - this is where the decoder can start emitting the uncompressed text. The first 3 purple characters are the unicode values for the fancy opening quote - because they're rare in this text, they're each encoded as 6 or 7 bits. Because they take a lot of bits, this website is showing them as a purple color, as well as physically wider. The nearby 't' is encoded in 4 bits, 0110, and is represented in a bluer color.

The orange bits you've mentioned are back references - "x10 <- 26" here means "go back 26 characters in what you've decoded, and then copy 10 characters again." In this way, we can represent "t was the " in only 12 bits, because we've seen it previously.

The grey at the end is a special "end of stream" marker, followed by a red checksum which allows decoders to make sure there wasn't any corruption in the input.

I think that's everything. Further reading: https://en.wikipedia.org/wiki/Zlib https://en.wikipedia.org/wiki/Deflate https://en.wikipedia.org/wiki/Huffman_coding

1 comments

Twirrim 268 days ago

Thank you! I appreciate the explanation

link

fwip 268 days ago

Happy to help :) I think compression algorithms are super cool, and zlib is a nice example of how just two simple techniques (Huffman coding and dictionary compression) can combine to usefully compress nearly any real-world data.

Newer compression algorithms like zstd, brotli and lz4 basically just use these same methods in different ways. (There's also slightly newer alternatives to Huffman coding, like Asymmetric Numeral Systems and Arithmetic Coding, but fundamentally they're the same concept).

link