Hacker News new | ask | show | jobs
by simonw 337 days ago
> I always thought tasks like this are usually just handed to an other (i.e. vision) model, but the post talks about it as if it's the _same_ model doing both text generation and vision.

Most vision LLMs don't actually use a separate vision model. https://huggingface.co/blog/vlms is a decent explanation of what's going on.

Most of the big LLMs these days are vision LLMs - the Claude models, the OpenAI models, Grok and most of the Gemini models all accept images in addition to text. To my knowledge none of them are using tool calling to a separate vision model for this.

Some of the local models can do this too - Mistral Small and Gemma 3 are two examples. You can tell they're not tool calling to anything because they run directly out of a single model weights file.

1 comments

Not a contradiction to anything you said, but O3 will sometimes whip up a python script to analyse the pictures I give it.

For instance, I asked it to compute the symmetry group of a pattern I found on a wallpaper in a Lebanese restaurant this weekend. It realised it was unsure of the symmetries and used a python script to rotate and mirror the pattern and compare to the original to check the symmetries it suspected. Pretty awesome!