Hacker News new | ask | show | jobs
by drdeca 667 days ago
CLIP is just for an embedding for images and text, right?

I might be getting mixed up… The diffusion part is just trained with the images, and the guidance part… is trained to produce the image when given the additional information of the embedding of the text? I find it difficult to imagine how the information from the CLIP embedding of the text could result in much information about the images that CLIP was trained with, ending up in the generated images?

3 comments

An understanding of language is important for conveying and achieving intent.

Imagine working with an artist in a multi-step refinement process to produce some desired artwork. Regardless of the artists skill, you'll probably get better results if you're able to communicate well.

That's kinda how the diffusion process works. It starts with noise, generates a rough output, then iteratively refines it. The classifier is part of the refinement process so it knows what to change.

"Hey, you've added a tree-looking-thing on your beach-looking-thing, you should add some palm fronds so it better fits the setting."

> CLIP is just for an embedding for images and text, right?

Yes, which is what makes text-to-image generation possible. You can go ahead and try using Stable Diffusion models, or even the incredibly high quality Flux, with no text "embedding" (or whatever you want to call it), and judge for yourself if those outputs are useful.

I get that, but my question is, “how can using the guidance from CLIP possibly make the resulting image infringe on copyright?”. I’m not saying that the CLIP part isn’t necessary for it to be useful.
The diffusion process is conditioned on CLIP text, which works better (in theory) since the encoded text is aligned with images.