| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Smerity 3552 days ago

This is the tactic used by the CNN / Dailymail dataset[1] released by Google Deepmind. In that situation you want everyone to have the same and original file. The contents of the URL may be updated and/or disappear. The original URLs are recorded but they're retrieved from the Wayback Machine.

Even with that a few of the pages are still essentially lost however - or at least I was never able to retrieve them. Kyunghyun Cho hosts a copy of the processed data on his site at NYU - who may be less likely to receive legal requests compared to Google or a similar commercial company hosting them.

Distribution of datasets is going to be a continued painpoint for machine learning in the future. No-one wants to be the legal guinea pig for exactly what entails fair use in ML.

[1]: https://github.com/deepmind/rc-data

2 comments

DarkLinkXXXX 3552 days ago

Agreed. Do you think maybe these datasets should include checksums so that they might be recovered with more certainty, from less obvious sources?

P.S.: Also, archive.org retroactively removes (or delists?) site content when blocked by robots.txt, as domain parkers often do, so I could see that causing trouble down the line. https://archive.org/post/1019415/retroactive-robotstxt-remov...

link

krasin 3544 days ago

I have added OriginalSize and OriginalMD5 columns to images.csv in the dataset: https://github.com/openimages/dataset/commit/526555b43ab74b6...

still thinking about a soft hash.

link

krasin 3552 days ago

Strict hashsums (like MD5 / SHA-1 / SHA-2) don't make too much sense here, as the images from the 'less obvious sources' might be thumbnails of different sizes. Or, probably, the same size but reencoded (not always a good idea, but happens in the practice).

Would a fuzzy hash, like the output from a known and fixed classifier work for that?

link

ramshorns 3552 days ago

For these images, a lot of the legal issues would be simpler and fair use wouldn't be necessary, since they're all CC-BY.

link

Smerity 3552 days ago

Not necessarily - they're listed as CC-BY on Flickr but that doesn't necessarily make it true. Specifically the user may not have authority to give out such copyright (as I assume for any case where the user "accidentally" set it to CC-BY they can't retract?)

Google included specific legalese on this very issue. See https://news.ycombinator.com/item?id=12616012

link