Hacker News new | ask | show | jobs
by thriftwy 1675 days ago
Zstandard has very cool dictionary training feature, which allows to keep a separate dictionary and have a 50% ratio compression on very small (~100b) but repetitive data such as database records.
1 comments

I've always thought it could be pretty cool to leverage that for transparent filesystem compression.

For context, filesystem compression usually compresses blocks of data individually (for instance, every 64K block of a file will be individually compressed, and when you modify a file in the middle, that block needs to be recompressed entirely). This is usually good enough, and it has some pretty cool properties, like being able to have only compressable parts of a file compressed, or turning on compression on a file and having only new and rewritten blocks get compressed. Because of Zstd's separated dictionary, it seems like it could be feasible to instead store the dictionary in the file's inode and compress the blocks with that dictionary (recomputing the dictionary and recompressing existing blocks when the file allocates 10 4K blocks and then again at 100 blocks, perhaps).

I wonder what different properties such a compression scheme would have. I imagine it would be able to achieve a much smaller size due to not having to store a dictionary with each compressed block. A downside would be that a corrupted or overwritten inode would render the file completely unrecoverable, where current compression schemes allow blocks to be individually decompressed. Another downside is that files can't be partially compressible, only entirely.

Compression dictionaries don't usually work that way, as far as my understanding goes they just look behind, like "take 15 bytes which were 63 bytes ago", so any data it also a dictionary.

Trained dictionary is just meaningless but very frequent bits of data which may be referenced in that fashion as if they preceded the real data.