|
|
|
|
|
by simonw
20 days ago
|
|
That's more of a test of vision LLM ability to correctly identify and count things in an image than it is of mathematical reasoning. If you look at the working of your non-Python example it gets most of the counts wrong - identifying shop A as two full notebooks plus one half notebook when it's actually three full notebooks, for example. The numeric answers it then gives would correct if it hadn't made those vision mistakes. I've been testing vision LLMs on counting the number of pelicans in a photo for a while, they're very unreliable at that. The best I've seen is Google Gemini 2.5 if you have it output image segmentation masks (a feature they have not included in the Gemini 3 series yet): https://simonwillison.net/2025/Apr/18/gemini-image-segmentat... - but that requires additional harness engineering, you need to explicitly cause it to use its image segmentation mechanism. |
|
†: think "bo's'n"