| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by felixhandte 1483 days ago

Good write-up! I applaud your use of dictionary compression! We've done a lot of work to Zstd to make it a powerful tool that opens doors to new and better compression integrations.

One correction: the "dictionary" in Zstd is sort of a misnomer. It's (mostly[0]) not a data structure. It's simply a concatenation of substrings that occurred frequently in the samples provided to the trainer. When you compress with a dictionary, that content is used as a prefix against which matches can be made (as if you were doing streaming compression).

This is an important observation because it allows other patterns of dictionary compression. Zstd-trained dictionaries can be used with other LZ77 algorithms (like LZ4) and they will be useful there as well. Or, because dictionaries can be totally unstructured content, you can use one piece of content directly as a dictionary for another. This is useful for example for use cases like you mention, where you store multiple versions of a document. You can compress one version using another as a dictionary, which produces a compressed object which represents the delta.

One other note is that I know of at least one other database that uses dictionary-based Zstd to compress chunks: RocksDB[1].

[0] Zstd-produced dictionaries do actually have a structured header (described here: https://datatracker.ietf.org/doc/html/rfc8878#section-5), but the bulk of the dictionary is unstructured.

[1] https://github.com/facebook/rocksdb/wiki/Dictionary-Compress...