Hacker News new | ask | show | jobs
by yunwilliamyu 3088 days ago
Hi there! I'm the author of the paper the OP implemented, so I thought I'd give a bit more context.

In practice, combining together HLL and MinHash as they do for AdRoll gives the same kinds of estimation errors. However, using a raw MinHash data structure requires O(log n) space. If you are not space constrained, I second nemothekid and suggest just using off-the-shelf HLL intersection + MinHash libraries, such as the ones he links to.

The HyperMinHash algorithm, which the OP implemented in Golang, combines together the MinHash and HLL data structures so that we only need O(log log n) space, where n is the size of the union. In practice, the difference is between buckets of size log n + loglog n bits for HLL-MinHash and loglog n + 10 bits for HyperMinHash.

Given the constants involved, for n = 2^32, this is only a space savings of about 60% (37 bits vs 15 bits). However, this increases as n grows, so for n=2^64, it's 77% (70 bits vs 16 bits).