|
|
|
|
|
by drdeca
667 days ago
|
|
CLIP is just for an embedding for images and text, right? I might be getting mixed up…
The diffusion part is just trained with the images, and the guidance part… is trained to produce the image when given the additional information of the embedding of the text? I find it difficult to imagine how the information from the CLIP embedding of the text could result in much information about the images that CLIP was trained with, ending up in the generated images? |
|
Imagine working with an artist in a multi-step refinement process to produce some desired artwork. Regardless of the artists skill, you'll probably get better results if you're able to communicate well.
That's kinda how the diffusion process works. It starts with noise, generates a rough output, then iteratively refines it. The classifier is part of the refinement process so it knows what to change.
"Hey, you've added a tree-looking-thing on your beach-looking-thing, you should add some palm fronds so it better fits the setting."