|
|
|
|
|
by astrange
969 days ago
|
|
The image model knows what images look like even without prompts, and if you train it on a trillion images it will create a latent space where similar pictures have similar embeddings. Inaccurate captions for some of them may mean that the text encoder can't get you to those embeddings, but they're still in there. What this means is that text prompting is a bad way to drive an image generating model. |
|