Hacker News new | ask | show | jobs
Dedup-ing LAION (60M duplicates) and ImageNet (1.2M duplicates) with fastdup (youtube.com)
3 points by dnth 1207 days ago
1 comments

The authors at fastdup ran an analysis on LAION 400M and Imagenet21K. Here's what they found.

LAION 400M

> 60M duplicates. > 962K broken images. > Various label discrepancies.

ImageNet21K

> 1.2M duplicate images. > 104K train/val leak.

fastdup GitHub repo - https://github.com/visual-layer/fastdup