Hacker News new | ask | show | jobs
by peterhull90 803 days ago
It's a consequence of being block-based as mentioned elsewhere, but interesting to note that cat'ing together bzip2 files gives a valid bzip2 file. That's the basis of pbzip2 [0] - it breaks the input file into chunks of 900K by default, compresses each chunk and then concatenates the compressed chunks. The individual chunks can be compressed in parallel if hardware allows.

[0]: https://man.freebsd.org/cgi/man.cgi?query=pbzip2&apropos=0&s...

1 comments

gzip isn't by default block based but does effectively support a dictionary reset command in the compressed stream. This “command” is essentially the start of the gzip header, so if you cat two bits of gzipped data together the result from decompressing the result is the same as the source data streams concatenated. This means you can turn gzip into a block-based process and therefore parallelise it in the same manner as bzip2, and this is how pigz⁰¹ works.

This dictionary reset trigger is how the “rsyncable”² option³ is implemented too. Resetting the compression dictionary this way every 1000 input bytes increases the size of the compressed output by surprisingly little⁴.

--

[0] https://zlib.net/pigz/

[1] I actually started making my own version of this, way back when, inspired by looking into how gzip's rsyncable option² worked, before discovering it already existed! I “finished” my version as far as a working PoC though as it was an interesting enough exercise.

[2] https://manpages.debian.org/bookworm/gzip/gzip.1.en.html#rsy...

[3] also supported by pigz⁰ where it is used within each block it compresses, though because it splits the input at regular intervals anyway (instead of a more dynamic approach) its output is naturally already more rsync compatible than plain gzip (though with the default 128KiB block size, notably less so than with the reset every 1000 input bytes)

[4] usually between 1% and 3% IIRC, depending on input content of course, for some inputs the difference could be lower than that range, or much higher