Hacker News new | ask | show | jobs
by jonplackett 414 days ago
The reason image-1 is so good is because it’s the same model doing the talking and the image making.

I wonder if the same would be true for a multi-modal diffusion model that can now also speak?

1 comments

Facebook has their Chameleon model from 2023 that was in this space. Ancient now.

There is also this GitHub project that I played with a while ago that's trying to do this. https://github.com/GAIR-NLP/anole

Are there any OSS models that follow this approach today? Or are we waiting for somebody to hack that together?