| HN Mirror

For the "all zeros" case, my concern is that you said you're forcing a reset every 1024 words. This implies that if you have N kilowords of zero data, then it takes N times as much space as a single kiloword of data.

Good compression algorithms effectively use the same storage for highly-redundant data (not limited to all zeros or even all the same single word, though all zeros can sometimes be a bit smaller), whether it's 1 kiloword or 1 gigaword (there might be a couple bytes difference since they need to specify a longer variable-size integer).

And this does not require giving up on random-access if you care about that - you can just separately include an "extent table" (works for large regular repeats - you will have to detect this anyway for other compression strategies, which normally give up on random-access), or (for small repeats only) use strides, or ...

For reference, BTRFS uses 128KiB chunks for its compression to support mmap and seeking. Of course, the caller should make sure to keep decompressed chunks in cache.