Hacker News new | ask | show | jobs
by simandl 1372 days ago
This is Laion-5B, you can read more about it here: https://laion.ai/blog/laion-5b/

Imagen and Stable-Diffusion both used subsets of this full 5.8B image set.

1 comments

Is Imagen actually trained on a subset of Laion-5B and nothing else? I've heard they used huge internal data sets.
They have their own datasets and included Laion-400M, a subset of 5b that was released prior to 5b. You can see a short explanation in imagen's "Limitations and Societal Impact" section at: https://imagen.research.google/.

> While a subset of our training data was filtered to removed noise and undesirable content, such as pornographic imagery and toxic language, we also utilized LAION-400M dataset which is known to contain a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes.