Hacker News new | ask | show | jobs
by natch 2417 days ago
What are some of the ways you can improve on the n-squared nature of these distance calculations at scale?

In your use case you have a limited set of hashes you are comparing against, but I'm thinking of the more difficult use case of comparing every image against every other image (the context is ML training and weeding out overly similar examples so that overfitting doesn't happen, not related to CSAM). This quickly becomes untenable if you get, say, millions or billions of images.

"Don't do that" is one answer but are there any other ways you have come across to mitigate that?

1 comments

Hi there! Our example on deduplication [1] takes the case of deduplicating of the Caltech256 [2] dataset for the purpose I think you're describing, which I interpreted as avoiding duplicates in a dataset so that you don't have images that end up in both your training and test sets. Even though the dataset contains >30K images, you can still do this in memory (and find a handful of duplicates!) because each category is relatively small and you probably don't need to deduplicate images of trains with images of dogs.

On the same page, we mention two tools (specifically, FAISS [3] and Annoy [4]) to help with doing approximate search for scales where computing the distance matrix is impractical.

[1] https://perception.thorn.engineering/en/latest/examples/dedu...

[2] https://authors.library.caltech.edu/7694/

[3] https://github.com/facebookresearch/faiss

[4] https://github.com/spotify/annoy

One more thing: I tried the different hashes that would run on my system (some had a dependency on opencv-contrib that was not met, and which python -m pip couldn't find) and found they aren't what I'm looking for. They may be good at finding different versions of the same image, but what I want is more like environment similarity. So that two pictures taken on the same street would be closer than a picture of the street versus a picture of a forest. They don't seem to be tuned for this task at all. So, my search continues. Let me know if you have suggestions on this as well.
Perceptual hashes in this family will probably not be the right fit for that use case. There is work out there to build hashes that are intended to find find semantically similar content, typically using CNNs. Imagededup [1] seems to have one of these though I have never used it myself.

[1] https://idealo.github.io/imagededup/

Yes looked at annoy previously and it is really impressive. Will probably revisit that as part of the solution here. You did interpret my use case exactly right. Thanks for the reply and especially for the links!