| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Rendello 1664 days ago

For anyone looking to write their own Deflate/Gzip/Zlib, I would recommend looking at the little program called `infgen`:

> infgen is a deflate stream disassembler. It will read a gzip, zlib, or raw deflate stream, and output a readable description of the contents.

https://github.com/madler/infgen

4 comments

thomastay 1664 days ago

Wow! This program is really cool and I'm honestly not surprised it's written by Mark Adler, the guy who wrote zlib, that guy is an amazing programmer.

Thanks, this would have saved me a bunch of time and I wish I'd known this before writing this article. I'll add it in as a footnote later on.

For anyone who's curious, here's the output of the infgen program on the gzip file in the article (I remade the file so the timestamp is different)

  $ cat test.out.gz | ./infgen -ddis
  ! infgen 2.6 output
  !
  time 1637813470 ! UTC Thu Nov 25 04:11:10 2021
  name 'test.out
  gzip
  !
  last                    ! 1
  fixed                   ! 01
  literal 'a              ! 10010001
  literal 'a              ! 10010001
  match 6 1               ! 00000 0000100
  literal 10              ! 00111010
  end                     ! 0000000
  ! stats literals 8.0 bits each (24/3)
  ! stats matches 66.7% (1 x 6.0)
  ! stats inout 5:6 (4) 9 0
                          ! 00
  ! stats total inout 5:6 (4) 9
  ! stats total block average 9.0 uncompressed
  ! stats total block average 4.0 symbols
  ! stats total literals 8.0 bits each
  ! stats total matches 66.7% (1 x 6.0)
  !
  crc
  length

link

cesarb 1664 days ago

> (I remade the file so the timestamp is different)

That's why you should always use "gzip -n". Having the file name and timestamp in the compressed file header was IMO a mistake, and newer single-file compressors (bzip2, xz, zstd) AFAIK don't even have these fields in their native file format.

link

Rendello 1663 days ago

Woah! I didn't even notice it was him. Most of the accepted answers relating to anything Deflate are by him on Stack Overflow, so I shouldn't be surprised.

link

tinus_hn 1663 days ago

If you use the < operator you don’t need to spawn a ‘cat’ process that does nothing but copy around the contents of the file.

link

WoodenChair 1664 days ago

Programming a gzip (deflate) decompressor is a really good learning project. I did it last year, ultimately producing a pretty poor but working implementation [0]. I ended up modifying Mark Adler's puff example program to see the intermediary tables it produces to help debug my own implementation. I wish I had known about infgen (also by Adler). The other resource I would recommend, beyond the official specs is this article by Joshua Davies[1], also mentioned in the original post here.

[0]: https://github.com/davecom/nflate [1]: https://commandlinefanatic.com/cgi-bin/showarticle.cgi?artic...

link

0d0a 1664 days ago

This tool served as a base for a fun project of mine [0], which consisted on injecting arbitrary bytes to a deflate stream without affecting the decompression output.

The main idea is that a stream is composed of blocks, and dynamic blocks can contain Huffman tables followed by a end-of-block symbol. Since there's no other symbols, such blocks won't produce any output when decompressed. So there's a few bytes for those tables that can be changed (e.g. to put a little ascii signature). However, this affects the codes present in the tables, which must pass some validations (for decompressors to unambiguously match parsed bits to codes). I used a SMT solver to figure out values for the remaining bytes, so that the tables had valid code counts.

[0]: https://nevesnunes.github.io/blog/2021/09/29/Decompression-M...

link

kuroguro 1664 days ago

Oh, that's really nice.

I wish there was a widely accepted standard to write down arbitrary binary formats so one could inspect/debug them. Or even use it as a library to puck/unpack the data. I guess the 010 editor comes close with it's binary templates.

link