Hacker News new | ask | show | jobs
by jerf 314 days ago
My rule of thumb that has been surprisingly robust over several uses of it is that if you gzip a JSON format you can expect it to shrink by a factor of about 15.

That is not the hallmark of a space-efficient file format.

Between repeated string keys and frequently repeated string values, that are often quite large due to being "human readable", it adds up fast.

"I was also contemplating the mentioned formats for another project, but they are hardly usable when you need to store binary data, such as images, compressed data, or simply arbitrary data."

One trick you can use is to prefix a file with some JSON or other readable value, then dump the binary afterwards. The JSON can have offsets into the binary as necessary for identifying things or labeling whether or not it is compressed or whatever. This often largely mitigates the inefficiency concerns because if you've got a big pile of binary data the JSON bloat by percent tends to be much smaller than the payload; if it isn't, then of course I don't recommend this.

2 comments

> One trick you can use is to prefix a file with some JSON or other readable value, then dump the binary afterwards.

The GLB container (binary glTF) works almost exactly as you described, except there is a fixed size header before the JSON part.

https://registry.khronos.org/glTF/specs/2.0/glTF-2.0.html#bi...

I can confirm usual compression ratios of 10-20 for JSON. For example, wikidata-20220103.json.gz is quite fun to work with. It is 109 GB, which decompresses to 1.4 TB, and even the non-compressed index for random access with indexed_gzip is 11 GiB. The compressed random access index format, which gztool supports, would be 1.4 GB (compression ratio 8). And rapidgzip even supports the compressed gztool format with further file size reduction by doing a sparsity analysis of required seek point data and setting all unnecessary bytes to 0 to increase compressibility. The resulting index is only 536 MiB.

The trick for the mix of JSON with binary is a good reminder. That's how the ASAR file archive format works. That could indeed be usable for what I was working on: a new file format for random seek indexes. Although the gztool index format seems to suffice for now.

1) replacing gzip compression with zstd will speed things up by a lot while also reducing disk size

2) Plain old sqlite seems like a good idea, for a format (and also widely supported). Fast indexes included

3) combining (1) and (2) is probably a good idea as well

4) there's also Parquet