Hacker News new | ask | show | jobs
by astrange 969 days ago
AI image models are almost entirely not trained on human labeled data; StableDiffusion is trained on scraping nearby text on the page, DALLE3 uses synthetic captions from an image-to-text model, Midjourney doesn't disclose what they do. You can't get humans to label a billion images.

One way you can tell this isn't true is that if you take an image model and prompt it with an image, or just surf through the latent space by changing the embeddings, you'll find absolutely everything in there, from non-stereotypical representations to undescribable things.

1 comments

> StableDiffusion is trained on scraping nearby text on the page

And that nearby text was written by humans, so it may not be explicitly labelled in HTML attributes but if the context wasn't related the scraping wouldn't work.

If you go looking in LAION it's often complete garbage. I think people underestimate how bad it is, and aesthetic finetuning does somehow fix it but not by writing better captions.

(How does it work? Beats me.)