Hacker News new | ask | show | jobs
by dpayne 4109 days ago
One caveat of perfect hash functions, including CMPH, is that for a very large amount of keys the data structs they use internally become very large. In the case of CMPH when I tried with around 30 million keys the result was several MBs which caused it to not fit in L3 cache. For large dictionaries, hash functions like xxhash and farmhash are usually a better option than the perfect hash functions.
4 comments

You can combine perfect hashing with string compression (Huffman, prefix tries) and de-duplication techniques to vastly improve memory usage for a wide variety of use cases. DiscoDB does this to good effect (http://discodb.readthedocs.org/en/latest/).
If you use a fast hash function you still have to look up a hash table, so the memory access is still there. If you use some clever probing strategy you might be able to have only 1 cache line access per lookup on average, while most perfect hashing functions, which use the MWHC scheme, do 3 random accesses.

However, these 3 random accesses are independent, so the CPU can mostly pipeline them, and the perceived latency is that of a single random access.

Thanks, I know. I used one of the CMPH algorithms for the first three default methods (from pure perl to pure C), -hanov, -hanovpp, and -urban, and those perform much better than the original CMPH. And I haven't enabled compression.

But I still haven't found the best overall method yet. Hashing all keys shouldn't be necessary. There will come more later.

I don't know much about the topic but I was under the impression a minimal 1 to 1 mapping is usually possible. ref: http://www.burtleburtle.net/bob/hash/perfect.html