Hacker News new | ask | show | jobs
by fpgaminer 980 days ago
The architecture is quite compelling. I would not have expected it to work as well as it does. Glancing at the benchmarks it's basically on par with other VLMs in its class, despite having no separate image encoder.

Is there an associated paper? Or more specifically, details on the training dataset? It must have been a mix of text and VLM tasks, otherwise one or the other capability would have rotted during training. But I wonder if they trained off strictly VLM corpora, or also used plain image-text datasets like CLIP. It would be interesting if only the former.

Also makes me wonder if it could be trained on something like CommonCrawl where all the images are retained and interspersed correctly throughout the text. This model could theoretically train just fine off that, and it would unlock a whole new dataset effectively.

And has there been an inspection of what the model is outputting for predicted image "tokens"? Is it correctly predicting projected image patches to any degree of accuracy? And could therefore also generate images inline with text if another de-projection layer was trained?

2 comments

I too would like to know about the training dataset, as I just took a look at the one for LLava[0], and found out that they used a pretty big amount of BLIP auto generated captions.

This seemed a bit surreal to me, like trying to train an LLM with the outputs of a worse performing smaller LLM.

[0] https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md#...

This is the first multimodal model i hear about that is open source. Are there already other alternatives?
The Fuyu pre-trained model is not open source. At best, it is source-available. It's also not the only multimodal model you can run locally.

A few other examples include LLaVA[0], IDEFICS[1][2], and CogVLM[3]. Mini-GPT[4] might be another one to look at. I'm pretty sure all of these have better licenses than Fuyu. Fuyu's architecture does sound really interesting, but the license on the pre-trained model is a complete non-starter for almost anything.

[0]: https://github.com/haotian-liu/LLaVA

[1]: https://huggingface.co/blog/idefics

[2]: https://huggingface.co/HuggingFaceM4/idefics-80b-instruct

[3]: https://github.com/THUDM/CogVLM

[4]: https://github.com/Vision-CAIR/MiniGPT-4