Hacker News new | ask | show | jobs
by VLM 4625 days ago
All of which is true and well written. However it brings up the obvious point missed in the otherwise well written article that yes, if I use 32 bit hashes to shard into 2 to the power of 32 databases, some will have more collisions than others, aka the 32 MSB of a 32 bit hash has some "bad looking" variations. But I don't care about non-randomness in the LSBs because I'm infinitely more likely to shard into, perhaps, 2 to the power of 4 database machines.

The original article did not descend into that obvious area of research. I see no particular reason why a hash algo that has the worst randomness shoved into the least significant byte (which I simply don't care about) might be an inherent result of smooshing the best randomness into its most significant nibble, which I do care very much about. Given the likely use case for a sharding hash, a smart hash designer would make sure that most of his effort is put into smooth distribution in MSB and perhaps totally ignore the LSB for a given amount of latency / computation / electrical power. After all, the actual users are more likely to hash based on the first 4 bits than the first 24 bits. Although you'll always run into people who think its funny to pull their shard subset out of the hash using the LSB (why?) or some random byte in the middle (why?)

I think MSB for shard key comes out of the tradition in the really olden days of sharding based on raw unhashed data. Sometimes thats random enough such that the MSB of the data makes an excellent shard key and hashing would just slow things down for a minimal gain, even today.

1 comments

Whoa, it's conventional to use the MSB? I usually just do

  bucket = hash(x) & mask
Should I expect non-cryptographic hashes and RNGs to produce better noise in the MSBs? That would explain "Sun Redefines Randomness"
Well, look on the bright side, at least you're not calculating

bucket = hash(x) % buckets

Unless you're stuck supporting non power of two number of shards in which case you need modulus. And shoving more randomness into the LSB is more important than the MSB.

I've thought about it some and I think anyone with a networking background is automatically going to "subnet" their data from MSB down out of pure raw habit. Of course you split top down, just like IP addrs.

I've also seen non-technical popular explanations of sharding which stereotypically use something like phone numbers and start at the most significant digit.

And a stereotypical test question is this does not work well with stuff like customer IDs or transaction IDs because they tend toward sequential-ish, unless you're talking about some kind of data mining thing not daily live transactions.

On the other hand it works well if you shard (sorta) by date if you use the right (usually non-native) format. So if you have dates like 2013-10-14 and shard by YYYYMM somehow, then it could be easy to wipe 201210 from the database because whichever shard its on, its probably not impacting the latency figures for 201310 today. Unless you wanted to speed up the delete by smoothing it across all shards in which case sharding by hash ends up being the smart idea.

Trying to do tricky things can turn into a mess when the tricky thing changes mid project, too.

This is actually an important topic for me. I am implementing a sharding distribution based on consistent hashing using MurmurHash3 as hash function.

I am taking the first 4-bytes of the hash function output and using that. I checked and MurmurHash3 mixes the first and last 8 bytes of hash output as a last step, but I am not sure how much differentiation there is in the first 4-bytes.

I guess it is something I should check.