|
|
|
|
|
by samscully
1183 days ago
|
|
The book Mining of Massive Datasets [1] has useful information on building an efficient similarity index using Jaccard/minhash. I would also recommend Otmar Ertl's papers on extensions of minhash that approximate Jaccard better in certain situations, e.g. superminhash [2]. [1] http://www.mmds.org/ Chapter 3
[2] https://arxiv.org/abs/1706.05698 |
|