Hacker News new | ask | show | jobs
by samscully 1183 days ago
The book Mining of Massive Datasets [1] has useful information on building an efficient similarity index using Jaccard/minhash. I would also recommend Otmar Ertl's papers on extensions of minhash that approximate Jaccard better in certain situations, e.g. superminhash [2].

[1] http://www.mmds.org/ Chapter 3 [2] https://arxiv.org/abs/1706.05698