|
|
|
|
|
by jparkie
3485 days ago
|
|
This library provide probabilistic set membership test on an individual element basis as the stream progresses with a fixed bound on memory. Every call to classifyDistinct() progresses the internal history of a ProbabilisticDeDuplicator. Traditionally, probabilistic set membership tests were accomplished by using Bloom Filters. However, Bloom Filters only work well on finite sets because as Bloom Filters age (fill up), the false positive rate approaches 1. This issue was addressed by Stable Bloom Filters which frees up space for inserts; however, this introduces false negatives. This library contributes three implementations of de-duplication algorithms centered around Bloom Filter variants whose design and replacement strategy allows it to reach stability faster than a Stable Bloom Filter while reducing the false FNR by even several orders of magnitude. |
|