| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by SaveTheRbtz 1560 days ago

I was recently playing[0] with the ZSTD seekable format[1]: a *valid* ZSTD stream format (utilizing ZSTD skippable frames, that are ignored by de-compressor) w/ an additional index at the end of the file that allows for random access to the underlying compressed data based on uncompressed offsets. This combined w/ a Content-Defined-Chunking[2] and a cryptographic hash[3] in the index allows for a very efficient content-addressable storage. For example I've successfully applied it to a bazel-cache, which gave me between 10x and 100x wins on repository size w/ negligible CPU usage increase.

[0] https://github.com/SaveTheRbtz/zstd-seekable-format-go

[1] https://github.com/facebook/zstd/blob/dev/contrib/seekable_f...

[2] e.g FastCDC https://www.usenix.org/system/files/conference/atc16/atc16-p...

[3] https://github.com/facebook/zstd/pull/2737

4 comments

wscott 1560 days ago

BitKeeper used a seekable compressed file format for revision control data. It allowed a data-structure to be dynamically loaded on demand without needed to uncompress the whole file. A large empty memory buffer was allocated and then read permission was removed with mprotect(). Then a signal handler populate regions of that buffer with data from the compressed file on demand using the ability to seek to certain boundaries.

This change achieved a 10X speedup on normal operations compared to the old code that used SCCS files.

The compressed format stored data blocks in arbitrary order and then the index at the end of the file gave the data layout. Then allows write to the file without rewriting. BitKeeper itself only needed to append to the end of the file, but the format could support inserting data in the middle by only appending to the physical file.

It also had a data redundancy CRC layer that could detect data corruption and recover data from some types of corruption.

https://www.bitkeeper.org/ https://github.com/bitkeeper-scm/bitkeeper/blob/master/src/l... https://github.com/bitkeeper-scm/bitkeeper/blob/994fb651a404...

link

rwmj 1560 days ago

Really wish this was part of official zstd (https://github.com/facebook/zstd/issues/395#issuecomment-535...) and not a contrib / separate tool.

link

tomberek 1560 days ago

That last proposal is mine! Thanks for the Go implementation, I’ll have to play with it some time. Looks like you’ve built something very close to my intention. Did you come across any additional changes needed to the spec? Any suggestions or results from going through the implementation that would help?

Another thought is if it is possible and how to coordinate te re-using dictionaries.

link

SaveTheRbtz 1555 days ago

Thanks for creating this proposal! Would love to see it committed upstream. For me it worked just fine -- I'll need to play a bit more with it and maybe put it into a real-world production setup to see how well it preforms on real-world data.

link

kldx 1560 days ago

Is your bazel cache implementation open source? I am dabbling in bazel and I am not sure where zstd fits in the bazel cache model. I'm interested in learning more about this

link

SaveTheRbtz 1555 days ago

I did PoC experiments with compression, chunking, and IPFS here: https://github.com/SaveTheRbtz/bazel-cache

If you need a mature compression implementation for bazel I would recommend using recent bazel versions w/ gRPC-based bazel-remote: https://github.com/buchgr/bazel-remote

bazel nowadays supports end-to-end compression w/ `--experimental_remote_cache_compression`: https://github.com/bazelbuild/bazel/pull/14041

link