This is the tactic used by the CNN / Dailymail dataset[1] released by Google Deepmind. In that situation you want everyone to have the same and original file. The contents of the URL may be updated and/or disappear. The original URLs are recorded but they're retrieved from the Wayback Machine.
Even with that a few of the pages are still essentially lost however - or at least I was never able to retrieve them. Kyunghyun Cho hosts a copy of the processed data on his site at NYU - who may be less likely to receive legal requests compared to Google or a similar commercial company hosting them.
Distribution of datasets is going to be a continued painpoint for machine learning in the future. No-one wants to be the legal guinea pig for exactly what entails fair use in ML.
Strict hashsums (like MD5 / SHA-1 / SHA-2) don't make too much sense here, as the images from the 'less obvious sources' might be thumbnails of different sizes. Or, probably, the same size but reencoded (not always a good idea, but happens in the practice).
Would a fuzzy hash, like the output from a known and fixed classifier work for that?
Not necessarily - they're listed as CC-BY on Flickr but that doesn't necessarily make it true. Specifically the user may not have authority to give out such copyright (as I assume for any case where the user "accidentally" set it to CC-BY they can't retract?)
Even with that a few of the pages are still essentially lost however - or at least I was never able to retrieve them. Kyunghyun Cho hosts a copy of the processed data on his site at NYU - who may be less likely to receive legal requests compared to Google or a similar commercial company hosting them.
Distribution of datasets is going to be a continued painpoint for machine learning in the future. No-one wants to be the legal guinea pig for exactly what entails fair use in ML.
[1]: https://github.com/deepmind/rc-data