|
|
|
|
|
by ashvardanian
711 days ago
|
|
Hashing or tiny neural nets combined with a Vector Search engine with Tanimoto/Jaccard is a very common deduplication strategy for large datasets. It might be wiser than using linear-complexity MapReduce operations. There is a nice Google project using 0.5 M parameter RETSim model and the USearch engine for that: https://github.com/google/unisim |
|