Hacker News new | ask | show | jobs
by senderista 919 days ago
Since the blog post mentioned a PR to replace linear probing with Robin Hood, I just wanted to mention that I found bidirectional linear probing to outperform Robin Hood across the board in my Java integer set benchmarks:

https://github.com/senderista/hashtable-benchmarks/blob/mast...

https://github.com/senderista/hashtable-benchmarks/wiki/64-b...

5 comments

Worth pointing out that this can depend a lot more on fiddly details than you might expect. In particular, you're dealing with a small fixed width allowing the hash to be stored in the table instead of the key. The article emphasizes variable-length keys, and I don't see any specialization on key sizes (if 4- and 8-byte keys aren't common then this makes sense; if they are then I'd expect dedicated table code for those sizes to be valuable). And set lookups are also just a bit different from value lookups. I think these cases are different enough that I have no idea if the results would carry over, although I can see how the bidirectional approach would reduce probing more than RH which seems good.

...and since I've done a lot of work with Robin Hood on small-key lookups, I can point out some little tweaks that have made a big difference for me. I have 8-byte lookups at just over 3ns/lookup[0], albeit at a very low load factor, typically <50%. A key step was to use the maximum possible hash as a sentinel value, handling it specially in case it shows up in the data. This way, instead of probing until finding an empty bucket or greater hash, probing just finds the first slot that's greater than or equal to the requested key's hash. So the lookup code[1] is very simple (the rest, not so much). The while loop is only needed on a hash collision, so at a low load factor a lookup is effectively branchless. However, these choices are specialized for a batched search where the number of insertions never has to be higher than the number of searches, and all the insertions can be done first. And focused on small-ish (under a million entries) tables.

[0] https://mlochbaum.github.io/bencharray/pages/search.html

[1] https://github.com/dzaima/CBQN/blob/5c7ab3f/src/singeli/src/...

Thanks for the links; the BQN impl looks really interesting. I believe TFA deals with only hash codes and offsets in the hash table proper (keys and values are stored separately in a dynamic array), so fixed-width keys/values still apply. It's true that you can't use keys interchangeably with hash codes for variable-length keys like I do for integer keys, but I don't expect that to affect the relative performance of RH vs. BLP. (I'm curious how they handle colliding hash codes; 32-bit hashes mean you have a ~50% probability of at least one collision at 2^16 keys, which isn't much.)
Looks like full keys are always compared if hash codes test equal, which is what I'd expect. For example: https://github.com/questdb/questdb/blob/master/core/src/main...
That's correct. In practice, there is an insignificant amount of hash collisions, so false comparisons are extremely rare.

And thanks for sharing your experience with RH and the links!

A QuestDB engineer here: These are cool benchmarks! The idea to try Robin Hood probing came to me after receiving some feedback on Reddit. I ran initial experiments, and the results were promising, leading to its integration into our codebase. Thank you so much for sharing your repository. Perhaps one day we'll explore bidirectional probing as well!

A snapshot of my happiness after running first experiments with Robin Hood: https://twitter.com/jerrinot/status/1730147245285150743 :)

Hi there!

I made the initial suggestion to look into Robin Hood hashing when it was first posted on Reddit.

Glad to see it make its way into the repo!

indeed! thank you for that :)
> just wanted to mention that I found bidirectional linear probing to outperform Robin Hood across the board in my Java integer set benchmarks

Research results from the last five years shows that Robin Hood hashing performs better than the other approaches under the right conditions. See this eval paper:

https://15721.courses.cs.cmu.edu/spring2023/papers/11-hashjo...

Bidirectional isn't benchmarked here, and only mentioned once offhand:

> For example, we could already start searching for elements at the slot with expected (average) displacement from their perfect slot and probe bidirectional from there. In practice, this is not very efficient due to high branch misprediction rates and/or unfriendly access pattern.

I think this indicates a regular Robin Hood insertion and modified search, which doesn't sound that similar to Amble and Knuth's method. And anyway the relative costs of mispredictions and cache misses vary wildly based on workflow (paper studies 8-byte keys only). The paper also doesn't present Robin Hood as a clear winner, which is how I interpreted your comment. It's shown as one of five suggestions in the decision graph at the end, and only recommended for load factors between 50% and 80% among other conditions.

Edit: And the paper is from 2015, not the last five years. Is this the right link?

Thanks! We're still benchmarking Robin Hood hashing and are open to further experiments. The benchmarks look promising.
Can bidirectional linear probing be used for any key type? Or do the keys need to be of some integral type?
The keys-as-hash codes trick only works for fixed-width integers; otherwise you have to explicitly store the hash codes. I assumed unique hash codes in my implementation, but I think you could easily adapt the algorithm to allow duplicate hash codes. Tolerating hash code collisions would avoid having to increase hash code size (for collisions in their case, you'd just need to probe multiple offsets in the key/value array).
Since the hashes are stored, you could order by hash. This would leave keys with the same hash unordered, so if you find an entry with equal hash but unequal key you have to keep probing, but that only matters on a full-hash collision.