Hacker News new | ask | show | jobs
by mlochbaum 919 days ago
Worth pointing out that this can depend a lot more on fiddly details than you might expect. In particular, you're dealing with a small fixed width allowing the hash to be stored in the table instead of the key. The article emphasizes variable-length keys, and I don't see any specialization on key sizes (if 4- and 8-byte keys aren't common then this makes sense; if they are then I'd expect dedicated table code for those sizes to be valuable). And set lookups are also just a bit different from value lookups. I think these cases are different enough that I have no idea if the results would carry over, although I can see how the bidirectional approach would reduce probing more than RH which seems good.

...and since I've done a lot of work with Robin Hood on small-key lookups, I can point out some little tweaks that have made a big difference for me. I have 8-byte lookups at just over 3ns/lookup[0], albeit at a very low load factor, typically <50%. A key step was to use the maximum possible hash as a sentinel value, handling it specially in case it shows up in the data. This way, instead of probing until finding an empty bucket or greater hash, probing just finds the first slot that's greater than or equal to the requested key's hash. So the lookup code[1] is very simple (the rest, not so much). The while loop is only needed on a hash collision, so at a low load factor a lookup is effectively branchless. However, these choices are specialized for a batched search where the number of insertions never has to be higher than the number of searches, and all the insertions can be done first. And focused on small-ish (under a million entries) tables.

[0] https://mlochbaum.github.io/bencharray/pages/search.html

[1] https://github.com/dzaima/CBQN/blob/5c7ab3f/src/singeli/src/...

1 comments

Thanks for the links; the BQN impl looks really interesting. I believe TFA deals with only hash codes and offsets in the hash table proper (keys and values are stored separately in a dynamic array), so fixed-width keys/values still apply. It's true that you can't use keys interchangeably with hash codes for variable-length keys like I do for integer keys, but I don't expect that to affect the relative performance of RH vs. BLP. (I'm curious how they handle colliding hash codes; 32-bit hashes mean you have a ~50% probability of at least one collision at 2^16 keys, which isn't much.)
Looks like full keys are always compared if hash codes test equal, which is what I'd expect. For example: https://github.com/questdb/questdb/blob/master/core/src/main...
That's correct. In practice, there is an insignificant amount of hash collisions, so false comparisons are extremely rare.

And thanks for sharing your experience with RH and the links!