Hacker News new | ask | show | jobs
by Rendello 1664 days ago
For anyone looking to write their own Deflate/Gzip/Zlib, I would recommend looking at the little program called `infgen`:

> infgen is a deflate stream disassembler. It will read a gzip, zlib, or raw deflate stream, and output a readable description of the contents.

https://github.com/madler/infgen

4 comments

Wow! This program is really cool and I'm honestly not surprised it's written by Mark Adler, the guy who wrote zlib, that guy is an amazing programmer.

Thanks, this would have saved me a bunch of time and I wish I'd known this before writing this article. I'll add it in as a footnote later on.

For anyone who's curious, here's the output of the infgen program on the gzip file in the article (I remade the file so the timestamp is different)

  $ cat test.out.gz | ./infgen -ddis
  ! infgen 2.6 output
  !
  time 1637813470 ! UTC Thu Nov 25 04:11:10 2021
  name 'test.out
  gzip
  !
  last                    ! 1
  fixed                   ! 01
  literal 'a              ! 10010001
  literal 'a              ! 10010001
  match 6 1               ! 00000 0000100
  literal 10              ! 00111010
  end                     ! 0000000
  ! stats literals 8.0 bits each (24/3)
  ! stats matches 66.7% (1 x 6.0)
  ! stats inout 5:6 (4) 9 0
                          ! 00
  ! stats total inout 5:6 (4) 9
  ! stats total block average 9.0 uncompressed
  ! stats total block average 4.0 symbols
  ! stats total literals 8.0 bits each
  ! stats total matches 66.7% (1 x 6.0)
  !
  crc
  length
> (I remade the file so the timestamp is different)

That's why you should always use "gzip -n". Having the file name and timestamp in the compressed file header was IMO a mistake, and newer single-file compressors (bzip2, xz, zstd) AFAIK don't even have these fields in their native file format.

Woah! I didn't even notice it was him. Most of the accepted answers relating to anything Deflate are by him on Stack Overflow, so I shouldn't be surprised.
If you use the < operator you don’t need to spawn a ‘cat’ process that does nothing but copy around the contents of the file.
Programming a gzip (deflate) decompressor is a really good learning project. I did it last year, ultimately producing a pretty poor but working implementation [0]. I ended up modifying Mark Adler's puff example program to see the intermediary tables it produces to help debug my own implementation. I wish I had known about infgen (also by Adler). The other resource I would recommend, beyond the official specs is this article by Joshua Davies[1], also mentioned in the original post here.

[0]: https://github.com/davecom/nflate [1]: https://commandlinefanatic.com/cgi-bin/showarticle.cgi?artic...

This tool served as a base for a fun project of mine [0], which consisted on injecting arbitrary bytes to a deflate stream without affecting the decompression output.

The main idea is that a stream is composed of blocks, and dynamic blocks can contain Huffman tables followed by a end-of-block symbol. Since there's no other symbols, such blocks won't produce any output when decompressed. So there's a few bytes for those tables that can be changed (e.g. to put a little ascii signature). However, this affects the codes present in the tables, which must pass some validations (for decompressors to unambiguously match parsed bits to codes). I used a SMT solver to figure out values for the remaining bytes, so that the tables had valid code counts.

[0]: https://nevesnunes.github.io/blog/2021/09/29/Decompression-M...

Oh, that's really nice.

I wish there was a widely accepted standard to write down arbitrary binary formats so one could inspect/debug them. Or even use it as a library to puck/unpack the data. I guess the 010 editor comes close with it's binary templates.