|
|
|
|
|
by VLM
4625 days ago
|
|
All of which is true and well written. However it brings up the obvious point missed in the otherwise well written article that yes, if I use 32 bit hashes to shard into 2 to the power of 32 databases, some will have more collisions than others, aka the 32 MSB of a 32 bit hash has some "bad looking" variations. But I don't care about non-randomness in the LSBs because I'm infinitely more likely to shard into, perhaps, 2 to the power of 4 database machines. The original article did not descend into that obvious area of research. I see no particular reason why a hash algo that has the worst randomness shoved into the least significant byte (which I simply don't care about) might be an inherent result of smooshing the best randomness into its most significant nibble, which I do care very much about. Given the likely use case for a sharding hash, a smart hash designer would make sure that most of his effort is put into smooth distribution in MSB and perhaps totally ignore the LSB for a given amount of latency / computation / electrical power. After all, the actual users are more likely to hash based on the first 4 bits than the first 24 bits. Although you'll always run into people who think its funny to pull their shard subset out of the hash using the LSB (why?) or some random byte in the middle (why?) I think MSB for shard key comes out of the tradition in the really olden days of sharding based on raw unhashed data. Sometimes thats random enough such that the MSB of the data makes an excellent shard key and hashing would just slow things down for a minimal gain, even today. |
|