Hacker News new | ask | show | jobs
by bytehamster 375 days ago
Many modern perfect hash functions are super close to the space lower bound, often having just between 3% and 50% overhead. So your claim that the space consumption "twice as low" is information theoretically impossible. With gperf, the space consumption is in the machine code instead of a data structure, but it's definitely not "for free". In fact, I'm pretty sure that the size of the machine code generated by gperf is very far from optimal. The only exception are tiny input sets with just a couple of hundred keys, where gperf probably wins due to lower constant overheads.
1 comments

My special use case is i.e. unicode property checks, for which gperf is not big enough, and which has millions of keys. Other cases are integer keys.

I'm certainly not willing to load they keys and mpfh properties at query-time from disc, as they are known in advance and can be compiled to C or C++ code in advance, which leads to an instant load-time, in opposition to your costly deserialization times in all your tools.

Your deserialization overhead space is not calculated, and also not the storage costs for the false positive check. It's rather academic, not practical

There is no deserialization time or space overhead. The measurements refer to the deserialized form. They are not loaded from disk.

About false positive checks, I think you misunderstand what a perfect hash function does.

See, everybody can see how you cheat your benchmarks. It's unrealistic to measure only the perfect hash function, when you discard the cost of the deserialization and false-positive checks.

That's always the case when dealing with you academics.

We dont care how you define your costs when we laugh about that. In reality you have to load the keys, the data structures and check for existance. You even discard the existance checks, by omitting the false-positive checks. Checking against a non-existing key will lead to a hit in your check. Only very rarely you know in advance if your key is in the set

Also, you miss the space costs for the ordering.