Hacker News new | ask | show | jobs
by dom0 3290 days ago
If you read the internals description it could just as well be about Borg, very similar principles here, though the application is very different.

By the way, both buzhash and SHA-256 are kinda poor choices for a new system, especially one that targets servers.

2 comments

Yap borg is the first thing I thought of. It already does a lot of this an more: encryption, configurable encoding, a rolling hash computed by the Buzhash algorithm and so on.

Maybe it wasn't geared for CDN delivery during restores but otherwise I've been impressed by borg so far (haven't deployed it in production, only played with it locally though).

https://github.com/borgbackup/borg

This is a description of the internal design:

http://borgbackup.readthedocs.io/en/stable/internals.html

The "latest" version of that page has seen significant additions: http://borgbackup.readthedocs.io/en/latest/internals.html
Thanks. That is a better reference indeed. It's got nice diagrams as well. Can't edit my post any longer, so hopefully others will just see your message.
both buzhash and SHA-256 are kinda poor choices for a new system

Why?

Low software performance
Considering that it uses xz for compression, does the performance of SHA-256 matter? (Well, using faster hash function can speed up finding duplicate blocks, which were already packed.)

I'm more interested to hear about buzhash, though.

I assumeā„¢ that xz won't stay the only choice. I think it's important to understand that in deduplication, you'll pass all data through your hashes one to two times. Regarding buzhash, it can break with byte granularity, and it has a dependency chain that prohibits parallelization. You'll likely never see it go faster than 700-750 MB/s on a desktop CPU (~3.8 GHz Haswell) and it won't profit from non-clock improvements of CPUs. Giving up byte-granularity allows significant improvements in performance, but I don't think anyone comprehensively analysed the impact on deduplication performance. I didn't.

(OTOH if your storage is faster than ~200-300 MB/s (buzhash and a hash, naively combined) then there is likely no issue using higher degrees of I/O concurrency, so you can work around these problems).

Thanks for explanation. Do you know of any implementations?
SHA-256 seems to score well on https://www.cryptopp.com/benchmarks.html .
Depends on the context. If there is a lot of hashing, then a faster alternative like BLAKE2b is better.
what would you suggest instead? numbers?
BLAKE2b mentioned above would be more than twice faster on 64-bit CPUs. But I think we'll soon see SHA-256 CPU instructions on most processors (ARM and lower-cost Intel and latest AMD already ship them -https://neosmart.net/blog/2017/will-amds-ryzen-finally-bring...), so I guess it's not important. For numbers, see blake2.net or bench.cr.yp.to.

For IoT devices, hashes that work on 32-bit words, like SHA-256, actually make more sense and will be faster, so BLAKE2s would work well.

What I'd like to hear from the above commenter is about a faster replacement for buzhash, which I'm also interested in.