|
|
|
|
|
by yunwilliamyu
2771 days ago
|
|
Hi there! I'm the lead author on the paper that Seif (expertly) implements. You're right that my Python implementation isn't at all optimized, as I was more interested in the theory, though I think Seif's Go implementation does much better. HyperLogLogs are normally used for intersections through inclusion-exclusion (though there are more sophisticated methods out there). As such, it is really good when the two sets A and B being intersected are about the same size and the intersection is large. It performs poorly when |A ^ B| << |A v B| because inclusion-exclusion means that a small error in the union size results in a large relative error in the intersection size. HyperLogLog uses 6 bits per bucket in normal use. For our purposes, the rule of thumb we propose uses 16 bits per bucket. So we lose a factor of ~3 on the number of buckets. In exchange, we don't lose as much accuracy when |A ^ B| << |A v B| and also do much better for multi-set intersections (where inclusion-exclusion becomes unwieldy). As an aside, there are a bunch of other really cool Jaccard index fingerprinting methods out there that are asymptotically better even than HyperMinHash. But, they aren't as easy to work with and lack some of the other nice properties of MinHash. |
|
Would you be willing to list out these better methods and detail their pros / cons versus MinHash? It would be super helpful for those of us who are students in this field.