Hacker News new | ask | show | jobs
by simonw 20 days ago
That's more of a test of vision LLM ability to correctly identify and count things in an image than it is of mathematical reasoning.

If you look at the working of your non-Python example it gets most of the counts wrong - identifying shop A as two full notebooks plus one half notebook when it's actually three full notebooks, for example. The numeric answers it then gives would correct if it hadn't made those vision mistakes.

I've been testing vision LLMs on counting the number of pelicans in a photo for a while, they're very unreliable at that.

The best I've seen is Google Gemini 2.5 if you have it output image segmentation masks (a feature they have not included in the Gemini 3 series yet): https://simonwillison.net/2025/Apr/18/gemini-image-segmentat... - but that requires additional harness engineering, you need to explicitly cause it to use its image segmentation mechanism.

1 comments

Fourth grade math's† students are learning geometry and how to draw simple plots. Vision ability (or tactile ability, for visually impaired students) is pretty important to understanding and solving those homework problems.

†: think "bo's'n"