Hacker News new | ask | show | jobs
by parsimo2010 413 days ago
This sounds like a neat idea but it seems like bad timing. OpenAI just released token-based that beats the best diffusion image generation. If diffusion isn't even the best at generating images, I don't know if I'm going to spend a lot of time evaluating it for text.

Speed is great but it doesn't seem like other text-based model trends are going to work out of the box, like reasoning. So you have to get dLLMs up to the quality of a regular autoregressive LLM and then you need to innovate more to catch up to reasoning models, just to match the current state of the art. It's possible they'll get there, but I'm not optimistic.

2 comments

The reason image-1 is so good is because it’s the same model doing the talking and the image making.

I wonder if the same would be true for a multi-modal diffusion model that can now also speak?

Facebook has their Chameleon model from 2023 that was in this space. Ancient now.

There is also this GitHub project that I played with a while ago that's trying to do this. https://github.com/GAIR-NLP/anole

Are there any OSS models that follow this approach today? Or are we waiting for somebody to hack that together?

Does it beat them because it's a transformer, or because it's a much larger end-to-end model with higher quality multimodal training?
I wonder if it benefits because it can attend to individual tokens of the prompt while generating, compared to typical diffusion models that just get a static vector embedding of the prompt.