| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by samscully 1229 days ago
	The book Mining of Massive Datasets [1] has useful information on building an efficient similarity index using Jaccard/minhash. I would also recommend Otmar Ertl's papers on extensions of minhash that approximate Jaccard better in certain situations, e.g. superminhash [2]. [1] http://www.mmds.org/ Chapter 3 [2] https://arxiv.org/abs/1706.05698