Hacker News new | ask | show | jobs
by yunwilliamyu 2772 days ago
Things like b-bit MinHash [Li, Konig, 2010]. A good summary of the problem with asymptotics and some lower bounds is Pagh, Stockel, Woodruff, 2014, "Is Min-Wise Hashing Optimal for Summarizing Set Intersection." There's been more recent work, but that's a good place to start.

These other techniques have lower space complexity than MinHash (or even HyperMinHash) but have the con that they lose the streaming and composability properties of MinHash. One of the main advantages of a MinHash (or HLL) sketch is that you can combine sketch(A) with sketch(B) to get sketch(A v B). Alternately, the data structures can be progressively updated as you see a stream of items. The other fingerprinting techniques often require storing auxiliary data during the generation process.