|
|
|
|
|
by rom1504
1280 days ago
|
|
Looks like you missed the whole point of this dataset. The idea that we proved is you can get a dataset with decent caption and images (that do match yes, you can see for yourself at https://rom1504.github.io/clip-retrieval/ ) that can be used to trained well performing models (eg openclip and stable diffusion) while using only automated filtering of a noisy source (common crawl) We further proved that idea by using aesthetic prediction, nsfw and watermark tags to select the best pictures. Is it possible to write caption manually? sure, but that doesn't scale much and won't make it possible to train general models. |
|