Hi there! Our example on deduplication [1] takes the case of deduplicating of the Caltech256 [2] dataset for the purpose I think you're describing, which I interpreted as avoiding duplicates in a dataset so that you don't have images that end up in both your training and test sets. Even though the dataset contains >30K images, you can still do this in memory (and find a handful of duplicates!) because each category is relatively small and you probably don't need to deduplicate images of trains with images of dogs.
On the same page, we mention two tools (specifically, FAISS [3] and Annoy [4]) to help with doing approximate search for scales where computing the distance matrix is impractical.
One more thing: I tried the different hashes that would run on my system (some had a dependency on opencv-contrib that was not met, and which python -m pip couldn't find) and found they aren't what I'm looking for. They may be good at finding different versions of the same image, but what I want is more like environment similarity. So that two pictures taken on the same street would be closer than a picture of the street versus a picture of a forest. They don't seem to be tuned for this task at all. So, my search continues. Let me know if you have suggestions on this as well.
Perceptual hashes in this family will probably not be the right fit for that use case. There is work out there to build hashes that are intended to find find semantically similar content, typically using CNNs. Imagededup [1] seems to have one of these though I have never used it myself.
Yes looked at annoy previously and it is really impressive. Will probably revisit that as part of the solution here. You did interpret my use case exactly right. Thanks for the reply and especially for the links!