Hacker News new | ask | show | jobs
by yunwilliamyu 2771 days ago
Hi there! I'm the lead author on the paper that Seif (expertly) implements. You're right that my Python implementation isn't at all optimized, as I was more interested in the theory, though I think Seif's Go implementation does much better.

HyperLogLogs are normally used for intersections through inclusion-exclusion (though there are more sophisticated methods out there). As such, it is really good when the two sets A and B being intersected are about the same size and the intersection is large. It performs poorly when |A ^ B| << |A v B| because inclusion-exclusion means that a small error in the union size results in a large relative error in the intersection size.

HyperLogLog uses 6 bits per bucket in normal use. For our purposes, the rule of thumb we propose uses 16 bits per bucket. So we lose a factor of ~3 on the number of buckets. In exchange, we don't lose as much accuracy when |A ^ B| << |A v B| and also do much better for multi-set intersections (where inclusion-exclusion becomes unwieldy).

As an aside, there are a bunch of other really cool Jaccard index fingerprinting methods out there that are asymptotically better even than HyperMinHash. But, they aren't as easy to work with and lack some of the other nice properties of MinHash.

2 comments

> As an aside, there are a bunch of other really cool Jaccard index fingerprinting methods out there that are asymptotically better even than HyperMinHash. But, they aren't as easy to work with and lack some of the other nice properties of MinHash.

Would you be willing to list out these better methods and detail their pros / cons versus MinHash? It would be super helpful for those of us who are students in this field.

Things like b-bit MinHash [Li, Konig, 2010]. A good summary of the problem with asymptotics and some lower bounds is Pagh, Stockel, Woodruff, 2014, "Is Min-Wise Hashing Optimal for Summarizing Set Intersection." There's been more recent work, but that's a good place to start.

These other techniques have lower space complexity than MinHash (or even HyperMinHash) but have the con that they lose the streaming and composability properties of MinHash. One of the main advantages of a MinHash (or HLL) sketch is that you can combine sketch(A) with sketch(B) to get sketch(A v B). Alternately, the data structures can be progressively updated as you see a stream of items. The other fingerprinting techniques often require storing auxiliary data during the generation process.

Thank you! I’ll probably further explore hmh. I have a draft implementation which I’ll compare to related structures soon. A key advantage that hyperminhash has is the ability to sample, if one is using a large enough sampled point and a reversible hash, which could be helpful for certain applications. (Since hll discards all input items)

In fact, I spend 8 bits per entry to facilitate SIMD acceleration, but I’m happy with the speed/memory tradeoff.

A lot of issues are eliminated by using Ertl’s MLE or JMLE formulas, but I do see how sets of combinatorial comparisons could be helped with hmh.