Hacker News new | ask | show | jobs
by microtonal 4615 days ago
Any read-world use case where you want to check set membership when false positives are allowed.

E.g. one particular Dutch natural language parser uses a left-corner algorithm. To prune the search space, the parser removes hypotheses with unlikely splines, since they will probably not lead to a good parse (the probability of a spline occurring in a good parse is estimated by parsing a large corpus).

Since this parser is written in Prolog and splines are represented as a Prolog terms. So, a natural approach is to assert likely splines as Prolog facts. Although most Prolog implementations do first argument indexing and the splines are ground (contain no variables), it's not an ideal lookup:

- Asserting all the terms requires a lot of memory.

- Lookup requires hashing the term and possibly doing multiple comparisons when there is more than one term with that hash.

This is also an application where a false positive does not matter - it means that you will not prune a particular spline. It can increase parsing time (since it increases the number of hypothesis), but since it purges fewer splines, you are not throwing away parses.

For this reason, we replaced the 'splines as Prolog facts' with a Bloom filter: you hash the Prolog term one or more times and use constant lookup in a bit array. This change made this functionality faster and a lot more compact[2].

As a matter of fact, I implemented this so that at compile time the hashes are computed, the array is constructed and then written as a C file. This is linked in to the Prolog module that does the guided parsing. So, in binary versions of the parser the Bloom filter is actually never computed[3].

[1] There is an understandable description here: http://www.let.rug.nl/vannoord/Presentations/Athens09/Main/p...

[2] Of course, a middle way would be to assert the hashes as facts. But it's still not by far as efficient.

[3] Fun fact: SICStus and SWI-Prolog use different term hashing methods, so the Bloom filter is Prolog-implementation specific ;).