Hacker News new | ask | show | jobs
by muzakthings 3487 days ago
Is this a library that does entity recognition and matching on individual elements in streams, or are the streams themselves being featurized/matched?
1 comments

This library provide probabilistic set membership test on an individual element basis as the stream progresses with a fixed bound on memory. Every call to classifyDistinct() progresses the internal history of a ProbabilisticDeDuplicator.

Traditionally, probabilistic set membership tests were accomplished by using Bloom Filters. However, Bloom Filters only work well on finite sets because as Bloom Filters age (fill up), the false positive rate approaches 1. This issue was addressed by Stable Bloom Filters which frees up space for inserts; however, this introduces false negatives.

This library contributes three implementations of de-duplication algorithms centered around Bloom Filter variants whose design and replacement strategy allows it to reach stability faster than a Stable Bloom Filter while reducing the false FNR by even several orders of magnitude.