| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by parsimo2010 413 days ago
	This sounds like a neat idea but it seems like bad timing. OpenAI just released token-based that beats the best diffusion image generation. If diffusion isn't even the best at generating images, I don't know if I'm going to spend a lot of time evaluating it for text. Speed is great but it doesn't seem like other text-based model trends are going to work out of the box, like reasoning. So you have to get dLLMs up to the quality of a regular autoregressive LLM and then you need to innovate more to catch up to reasoning models, just to match the current state of the art. It's possible they'll get there, but I'm not optimistic.

2 comments

jonplackett 413 days ago

The reason image-1 is so good is because it’s the same model doing the talking and the image making.

I wonder if the same would be true for a multi-modal diffusion model that can now also speak?

link

freeqaz 413 days ago

Facebook has their Chameleon model from 2023 that was in this space. Ancient now.

There is also this GitHub project that I played with a while ago that's trying to do this. https://github.com/GAIR-NLP/anole

Are there any OSS models that follow this approach today? Or are we waiting for somebody to hack that together?

link

orbital-decay 413 days ago

Does it beat them because it's a transformer, or because it's a much larger end-to-end model with higher quality multimodal training?

link

scratchyone 413 days ago

I wonder if it benefits because it can attend to individual tokens of the prompt while generating, compared to typical diffusion models that just get a static vector embedding of the prompt.

link