| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dpayne 4109 days ago
	One caveat of perfect hash functions, including CMPH, is that for a very large amount of keys the data structs they use internally become very large. In the case of CMPH when I tried with around 30 million keys the result was several MBs which caused it to not fit in L3 cache. For large dictionaries, hash functions like xxhash and farmhash are usually a better option than the perfect hash functions.

4 comments

t1m 4109 days ago

You can combine perfect hashing with string compression (Huffman, prefix tries) and de-duplication techniques to vastly improve memory usage for a wide variety of use cases. DiscoDB does this to good effect (http://discodb.readthedocs.org/en/latest/).

link

ot 4109 days ago

If you use a fast hash function you still have to look up a hash table, so the memory access is still there. If you use some clever probing strategy you might be able to have only 1 cache line access per lookup on average, while most perfect hashing functions, which use the MWHC scheme, do 3 random accesses.

However, these 3 random accesses are independent, so the CPU can mostly pipeline them, and the perceived latency is that of a single random access.

link

rurban 4109 days ago

Thanks, I know. I used one of the CMPH algorithms for the first three default methods (from pure perl to pure C), -hanov, -hanovpp, and -urban, and those perform much better than the original CMPH. And I haven't enabled compression.

But I still haven't found the best overall method yet. Hashing all keys shouldn't be necessary. There will come more later.

link

hurin 4109 days ago

I don't know much about the topic but I was under the impression a minimal 1 to 1 mapping is usually possible. ref: http://www.burtleburtle.net/bob/hash/perfect.html

link