|
|
|
|
|
by fpgaminer
980 days ago
|
|
The architecture is quite compelling. I would not have expected it to work as well as it does. Glancing at the benchmarks it's basically on par with other VLMs in its class, despite having no separate image encoder. Is there an associated paper? Or more specifically, details on the training dataset? It must have been a mix of text and VLM tasks, otherwise one or the other capability would have rotted during training. But I wonder if they trained off strictly VLM corpora, or also used plain image-text datasets like CLIP. It would be interesting if only the former. Also makes me wonder if it could be trained on something like CommonCrawl where all the images are retained and interspersed correctly throughout the text. This model could theoretically train just fine off that, and it would unlock a whole new dataset effectively. And has there been an inspection of what the model is outputting for predicted image "tokens"? Is it correctly predicting projected image patches to any degree of accuracy? And could therefore also generate images inline with text if another de-projection layer was trained? |
|
This seemed a bit surreal to me, like trying to train an LLM with the outputs of a worse performing smaller LLM.
[0] https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md#...