| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by DarkLinkXXXX 3550 days ago
	Agreed. Do you think maybe these datasets should include checksums so that they might be recovered with more certainty, from less obvious sources? P.S.: Also, archive.org retroactively removes (or delists?) site content when blocked by robots.txt, as domain parkers often do, so I could see that causing trouble down the line. https://archive.org/post/1019415/retroactive-robotstxt-remov...

2 comments

krasin 3542 days ago

I have added OriginalSize and OriginalMD5 columns to images.csv in the dataset: https://github.com/openimages/dataset/commit/526555b43ab74b6...

still thinking about a soft hash.

link

krasin 3550 days ago

Strict hashsums (like MD5 / SHA-1 / SHA-2) don't make too much sense here, as the images from the 'less obvious sources' might be thumbnails of different sizes. Or, probably, the same size but reencoded (not always a good idea, but happens in the practice).

Would a fuzzy hash, like the output from a known and fixed classifier work for that?

link