Hacker News new | ask | show | jobs
by bruce_lipshitz 2770 days ago
> As an aside, there are a bunch of other really cool Jaccard index fingerprinting methods out there that are asymptotically better even than HyperMinHash. But, they aren't as easy to work with and lack some of the other nice properties of MinHash.

Would you be willing to list out these better methods and detail their pros / cons versus MinHash? It would be super helpful for those of us who are students in this field.

1 comments

Things like b-bit MinHash [Li, Konig, 2010]. A good summary of the problem with asymptotics and some lower bounds is Pagh, Stockel, Woodruff, 2014, "Is Min-Wise Hashing Optimal for Summarizing Set Intersection." There's been more recent work, but that's a good place to start.

These other techniques have lower space complexity than MinHash (or even HyperMinHash) but have the con that they lose the streaming and composability properties of MinHash. One of the main advantages of a MinHash (or HLL) sketch is that you can combine sketch(A) with sketch(B) to get sketch(A v B). Alternately, the data structures can be progressively updated as you see a stream of items. The other fingerprinting techniques often require storing auxiliary data during the generation process.