Hacker News new | ask | show | jobs
by the_alchemist 3686 days ago
> Riak uses the SHA hash as its distribution mechanism and divides the output range of the SHA hash evenly amongst participating nodes in the cluster.

Wait, Riak uses SHA as distribution hash? Why use a cryptographic hash for distribution and not something like Murmur3, if you're talking about high-performant[0] ?

[0] http://blog.reverberate.org/2012/01/state-of-hash-functions-...

4 comments

The hash function in this case is used purely for generating an integer from a small binary blob (bucket/key pair) and that integer is the deterministic artifact that tells Riak Core which machines and hash partitions are supposed to own that data.

The performance impact of that is massively dwarfed (by probably 3+ orders of magnitude) by everything else that's going on in the critical sequential read/write path.

i'd bet it doesn't matter.
Author here. We hash the key so the number of bytes are negligible. So, I'd agree, I don't think it matters.
Isn't it more about the number of cycles it takes to generate the hash more than it's size, since you're likely to do that very often in a db context?
Probably. I'd venture to say the reason we use SHA is due to its uniform distribution over its speed (or lack thereof.)
I guess it's worth investigating, murmur has been in use by some big names (cassandra, elastic search, hadoop, etc...) for a while in similar contexts.
I was pretty sure (checked almost 2 years ago) that murmur3 had a uniform distribution, too. And in terms of speed is a relatively easy gain
More time is likely spent synchronizing data across the network than hashing. (And the variance for the network time is likely high enough to account for the hash time)
Conjecture: not sure if this is the reason why SHA is used, but a useful side effect is that it may make users of Riak less vulnerable to certain types of denial of service attacks. Not 100% sure since I know little about how Riak works, but a more predictable hashing algorithm could make it easy for attackers to overload a given bucket with data, and slow down the db to a crawl.