I wonder if the same would be true for a multi-modal diffusion model that can now also speak?
There is also this GitHub project that I played with a while ago that's trying to do this. https://github.com/GAIR-NLP/anole
Are there any OSS models that follow this approach today? Or are we waiting for somebody to hack that together?
There is also this GitHub project that I played with a while ago that's trying to do this. https://github.com/GAIR-NLP/anole
Are there any OSS models that follow this approach today? Or are we waiting for somebody to hack that together?