Hacker News new | ask | show | jobs
by lucb1e 1166 days ago
Would pigz's parallel compression prevent that failure case, at least limiting it to its block size of a default 128K? https://github.com/madler/pigz/blob/master/pigz.1#L43-L45

That would be a nice extra benefit, besides the speedup from being multithreaded. (I assume zstd also does multithreading but for those stuck with gzip, this is a drop-in replacement.)

Edit: bzip2 apparently does the same, "bzip2 compresses files in blocks, usually 900 kbytes long. Each block is handled independently. If a media or transmission error causes a multi-block .bz2 file to become damaged, it may be possible to recover data from the undamaged blocks in the file." (--man bzip2)

2 comments

No, as far as I know, the pigz blocks, not to be confused with deflate blocks, are still compressed by referring to the preceding uncompressed data even if it belongs to another pigz block. Therefore errors would still propagate indefinitely in the worst case.

Other gzip variant formats like bgzip also make the chunks compressed in parallel completely independent. This results in ~3% worse compression ratio depending on the use case.

Note that another problem with bit flips and other errors in compression formats is that most decompression tool will simply quit on the first error even if the rest of the data could still be recovered.

Yes, bz2 is also more robust against errors because of the independent blocks.

One thing to note is that DEFLATE (underlying algorithm that Gzip uses) doesn't indicate the length of blocks, or make it easy to figure out where they start/end without decoding everything proceeding it. This is likely why pigz can only parallelize compression, but not decompression.

But even if you could identify the block boundaries up front, DEFLATE doesn't reset the LZ77 window on a new block, so corruption could still seep through to the end.