|
|
|
|
|
by sly010
336 days ago
|
|
Genuine question: How does this work? How does an LLM do object detection? Or more generally, how does an LLM do anything that is not text? I always thought tasks like this are usually just handed to an other (i.e. vision) model, but the post talks about it as if it's the _same_ model doing both text generation and vision. It doesn't make sense to me why would Gemini 2 and 2.5 would have different vision capabilities, shouldn't they both have access to the same, purpose trained state of the art vision model? |
|
Different models have different encoders, they are not shared as the datasets across models and even model sizes vary. So performance between models will vary.
What you seem to be thinking is that text models were simply calling an API to a vision model, similar to tool-use. That is not what’s happening, it is much more inbuilt, the forward pass is going through the vision architecture to the language architecture. Robotics research has been doing this for a while.