Hacker News new | ask | show | jobs
by DarkLinkXXXX 3550 days ago
Agreed. Do you think maybe these datasets should include checksums so that they might be recovered with more certainty, from less obvious sources?

P.S.: Also, archive.org retroactively removes (or delists?) site content when blocked by robots.txt, as domain parkers often do, so I could see that causing trouble down the line. https://archive.org/post/1019415/retroactive-robotstxt-remov...

2 comments

I have added OriginalSize and OriginalMD5 columns to images.csv in the dataset: https://github.com/openimages/dataset/commit/526555b43ab74b6...

still thinking about a soft hash.

Strict hashsums (like MD5 / SHA-1 / SHA-2) don't make too much sense here, as the images from the 'less obvious sources' might be thumbnails of different sizes. Or, probably, the same size but reencoded (not always a good idea, but happens in the practice).

Would a fuzzy hash, like the output from a known and fixed classifier work for that?