Hacker News new | ask | show | jobs
by dchest 3278 days ago
Considering that it uses xz for compression, does the performance of SHA-256 matter? (Well, using faster hash function can speed up finding duplicate blocks, which were already packed.)

I'm more interested to hear about buzhash, though.

1 comments

I assumeā„¢ that xz won't stay the only choice. I think it's important to understand that in deduplication, you'll pass all data through your hashes one to two times. Regarding buzhash, it can break with byte granularity, and it has a dependency chain that prohibits parallelization. You'll likely never see it go faster than 700-750 MB/s on a desktop CPU (~3.8 GHz Haswell) and it won't profit from non-clock improvements of CPUs. Giving up byte-granularity allows significant improvements in performance, but I don't think anyone comprehensively analysed the impact on deduplication performance. I didn't.

(OTOH if your storage is faster than ~200-300 MB/s (buzhash and a hash, naively combined) then there is likely no issue using higher degrees of I/O concurrency, so you can work around these problems).

Thanks for explanation. Do you know of any implementations?