|
|
|
|
|
by natch
2417 days ago
|
|
What are some of the ways you can improve on the n-squared nature of these distance calculations at scale? In your use case you have a limited set of hashes you are comparing against, but I'm thinking of the more difficult use case of comparing every image against every other image (the context is ML training and weeding out overly similar examples so that overfitting doesn't happen, not related to CSAM). This quickly becomes untenable if you get, say, millions or billions of images. "Don't do that" is one answer but are there any other ways you have come across to mitigate that? |
|
On the same page, we mention two tools (specifically, FAISS [3] and Annoy [4]) to help with doing approximate search for scales where computing the distance matrix is impractical.
[1] https://perception.thorn.engineering/en/latest/examples/dedu...
[2] https://authors.library.caltech.edu/7694/
[3] https://github.com/facebookresearch/faiss
[4] https://github.com/spotify/annoy