Hacker News new | ask | show | jobs
by coder543 901 days ago
Being multimodal doesn’t seem to require much of a size penalty: https://github.com/dlyuangod/TinyGPT-V

Even so, Google treats the Gemini Pro Vision model as a separate model from Gemini Pro, so it could have separate parameters that are dedicated to vision (like CogVLM does), and that wouldn’t impact the size of the model as far as text-tasks are concerned.