Hacker News new | ask | show | jobs
by tmostak 369 days ago
We've made extensive use of perfect hashing in HeavyDB (formerly MapD/OmniSciDB), and it has definitely been a core part of achieving strong group by and join performance.

You can use perfect hashes not only the usual suspects of contiguous integer and dictionary-encoded string ranges, but also use cases like binned numeric and date ranges (epoch seconds binned per year can use a perfect hash range of one bin per year for a very wide range of timestamps), and can even handle arbitrary expressions if you propagate the ranges correctly.

Obviously you need a good "baseline" hash path to fall back to you, but it's surprising how many real-world use cases you can profitably cover with perfect hashing.

1 comments

So in HeavyDB do you on-the-fly build perfect hashes for queries? I've only ever seen perfect hashes used at 'build time' when the keys are already known and fixed (like keywords in a compiler)
I had the same question! I have never heard of runtime perfect hashing. (Admittedly, I haven’t read the paper yet.)
In the DSA theory literature there is so-called “dynamic perfect hashing” but I don’t think it’s ever been implemented and its use case is served by high-load factor techniques like bucketized cuckoo hashing.
In the appendix of the survey, there are 3 references on dynamic perfect hashing. I think the only actual implementation of a dynamic PHF is a variant of perfect hashing though fingerprinting in the paper "perfect hashing for network applications". However, that implementation is not fully dynamic and needs to be re-built if the key set changes too much.
All of those modern algorithms, even relatively older ones like CHD, can find a perfect hash function over millions of keys in less than a second.[1] Periodically rebuilding a function can be more than fast enough, depending on your use case.

Last time I tried gperf 8-10 years, it took it hours or even days to build a hash function CHD could do in seconds or less. If someone's idea of perfect hash function construction cost is gperf (at least gperf circa 2015)... welcome to the future.

[1] See my implementation of CHD: https://25thandclement.com/~william/projects/phf.html

The article had a reference for being able to compress data in a table (maybe similar in spirit to using a foreign key to a small table). I could also see it being useful in compression dictionaries, but again that's not really a run-time use (and I'm sure I'm not the first to think of it)