Hacker News new | ask | show | jobs
by visarga 1414 days ago
If you want multiple objects, each with individual attributes, the unCLIP model still has to make a single embedding vector representation. But the single vector is too small to contain an ever more detailed scene description. That's why it has failure modes like assigning the wrong colour to cubes and not being able to spell text.

On the other hand the previous approach - autoregressive generation - allows full access through the attention mechanism to the prompt.

For example Imagen encodes text to a sequence of embeddings.

> Imagen comprises a frozen T5-XXL [52] encoder to map input text into a sequence of embeddings and a 64×64 image diffusion model, followed by two super-resolution diffusion models