Hacker News new | ask | show | jobs
by yunwilliamyu 2771 days ago
Should note here that my benchmarks are comparing HyperMinHash to MinHash. I didn't directly compare to HyperLogLog in the paper, though there exist other papers (mostly in the CS theory literature) that do compare MinHash to HyperLogLog in various intersection regimes.
1 comments

In my benchmarks, HyperLogLogs unilaterally outperform MinHash* by a constant factor, while Bloom Filters of the same size become more accurate than hyperloglogs in the 100-500kB range (though this number increases with the cardinalities of sets being compared). The biggest advantage a HyperLogLog has over MinHash and bloom filters is that it works well regardless of the sizes of the sets being compared, while MinHash is sensitive to differences in set size and bloom filters perform terribly in the small sketch domain. (These results aren't yet public, but they will be soon.)

* Part of this might be issues with the original Flajolet et al. estimation algorithm. Ertl provides several improved methods in https://arxiv.org/pdf/1702.01284.pdf.