Hacker News new | ask | show | jobs
by simandl 1371 days ago
This is Laion-5B, https://laion.ai/blog/laion-5b/

It's built off of common crawl, so it probably does have a pretty representative sample from whatever the big image searches use.

Funny enough, the NSFW filter that laion built is turned on. Without it, it's... a lot. The NSFW stuff is done with a model, so you get a probability of NSFW out of it, and you can select a threshold.

If you set the threshold high, like 95% certainty that an image is nsfw to filter it, you get a bunch of false negatives, letting a ton of nsfw through. Set it too low, and you throw out stuff that isn't nsfw.

We (haveibeentrained) erred on the side of too high, so we wouldn't tell artists their work wasn't in there if it was. Tough trade-off there. Similar to using the dataset to train an AI model, where you might cut off useful images from training if you try and filter all the nsfw.