Hacker News new | ask | show | jobs
by orbital-decay 416 days ago
Does it beat them because it's a transformer, or because it's a much larger end-to-end model with higher quality multimodal training?
1 comments

I wonder if it benefits because it can attend to individual tokens of the prompt while generating, compared to typical diffusion models that just get a static vector embedding of the prompt.