Hacker News new | ask | show | jobs
by dnth 1206 days ago
The authors at fastdup ran an analysis on LAION 400M and Imagenet21K. Here's what they found.

LAION 400M

> 60M duplicates. > 962K broken images. > Various label discrepancies.

ImageNet21K

> 1.2M duplicate images. > 104K train/val leak.

fastdup GitHub repo - https://github.com/visual-layer/fastdup