You are probably very aware of it, but just to highlight the importance of this for people who aren't aware: data duplication degrades the training and makes memorization (and therefore plagiarism, in the technical sense) more likely. For language models, this includes near-similarities, which I'd guess would extend to images.