Hacker News new | ask | show | jobs
by Filligree 637 days ago
One of my favorite gzip party tricks is that (ungzip (cat (gzip a) (gzip b))) == (cat a b). That is to say, the concatenation of two gzip streams is still a valid gzip file.

This hasn't ever been practically useful, but it means you can trivially create a 19-layer gzip file containing more prayer strips than there are atoms in the universe, providing a theological superweapon. All you need to do is write it to a USB-stick, then drop the USB-stick in a river, and you will instantly cause a heavenly crisis of hyperinflation.

10 comments

In bioinformatics we use a modified gzip format called bgzip that exploits this fact heavily. The entire file is made of concatenated gzip chunks. Each chunk then contains the size of the chunk (stored in the gzip header). This lets you do random access inside the compressed blocks more efficiently.

Sadly, the authors hard coded the expected headers so it’s not fully gzip compatible (you can’t add your own arbitrary headers). For example, I wanted to add a chunk hash and optional encryption by adding my own header elements. But as the original tooling all expects a fixed header, it can’t be done in the existing format.

But overall it is easily indexed and makes reading compressed data pretty easy.

So, there you go - a practical use for a gzip party trick!

In such case, imo, it would be more appropriate to use ZIP container format which supports multiple independent entries (files) and index table (including sizes) for them. Compression algorithm is essentially the same (Deflate) so it would not be bloated in any way. As for practical implementations of serializing complex "objects" into ZIP, numpy [0] can be mentioned as an example.

[0] https://numpy.org/doc/stable/reference/generated/numpy.savez...

On top of enabling indexing, it reduces the amount of data lost in the event of data corruption — something you get for free with block-based compression algorithms like BWT-based bzip2 but is most of the time missing from dictionary-based algorithms like LZ-based gzip.

I don't think many people use that last property or are even aware of it, which is a shame. I wrote a tool (bamrescue) to easily recover data from uncorrupted blocks of corrupted BAM files while dropping the corrupted blocks and it works great, but I'd be surprised if such tools were frequently used.

Why do you think I wanted to add hashes and encryption at the block level? :)

I’ve had to do similar things in the past and it’s a great side-feature of the format. It’s a horrible feeling when you find a corrupted FASTQ file that was compressed with normal gzip. At least with bgzip corrupted files, you can find and start recovery from the next block.

Even if it doesn't use block-based compression, if there isn't a huge range of corrupted bytes, corruption offsets are usually identifiable, as you will quickly end up with invalid length-distance pairs and similar errors. Although, errors might be reported a few bytes after the actual corruption.

I was motivated some years ago to try recovering from these errors [1] when I was handling a DEFLATE compressed JSON file, where there seemed to be a single corrupted byte every dozen or so bytes in the stream. It looked like something you could recover from. If you output decompressed bytes as the stream was parsed, you could clearly see a prefix of the original JSON being recovered up to the first corruption.

In that case the decompressed payload was plaintext, but even with a binary format, something like kaitai-struct might give you an invalid offset to work from.

For these localized corruptions, it's possible to just bruteforce one or two bytes along this range, and reliably fix the DEFLATE stream. Not really doable once we are talking about a sequence of four or more corrupted bytes.

[1]: https://github.com/nevesnunes/deflate-frolicking

Even cooler is that it's possible to create an infinite-layer gzip file: https://honno.dev/gzip-quine/
Is that by any chance related to how the TAR format was developed?

Considering the big thing with TAR is that you can also concatenate it together (the format is quite literally just file header + content ad infinitum; it was designed for tape storage - it's also the best concatenation format if you need to send an absolute truckloads of files to a different computer/drive since the tar utility doesn't need to index anything beforehand), making gzip also capable of doing the same logic but with compression seems like a logical followthrough.

I assume tar came first just for grouping things together and then compression came out and they were combined together. That's why the unix tarballs were always tar.gz prior to gz having the ability to do both things.
We use it at $dayjob to concatenate multi-GB gzipped files. We could decompress, cat, and compress again, but why spend the cycles?
> This hasn't ever been practically useful,

I used it a couple times to merge chunks of gzipped CSV together, you know, like "cat 2024-Jan.csv.gz 2024-Feb.csv.gz 2024-Mar.csv.gz > 2024-Q1.csv.gz". Of course, it only works when there is no column headers.

I think the alpine package format do use this in combination with tar being similar https://wiki.alpinelinux.org/wiki/Apk_spec
This is true for a few different compression formats, it works for bzip2 too. I've processed a few TBs of text via `curl | tar -xOf - | bzip2 -dc | grep` for tar files with lots of individually compressed bz2 files inside.
It's kind of nice whenever you find yourself in an environment where, for whatever reason, you need to split a payload before sending it. You just `ungzip cat *` the gzipped files you've collected on the other end.
I've used this trick to build docker images from a collection of layers on the fly without de/recompression
I think ZSTD is also like this.