Reasoning models with access to Python have been able to solve 4th grade math homework for over a year now. Prove me wrong: show me a 4th grade math problem they can't handle.
Fourth graders typically don't have access to Python for their homework assignments. To be fair to the kids, I tried it first without Python: Opus 4.6 (Feb 2026) with default Medium effort. https://claude.ai/share/1533a3e4-6757-4614-b95d-0743350a6598
It got questions 2 (Shop D) and 5 (280) wrong. It got question 3 right but the work it showed has the numbers for each shop wrong. My fourth grade teacher would have taken off points for that (shout out Mrs. Van Bladel).
This time it used Python to "check its work", and answered the same questions incorrectly (2 and 5). To the model's credit, it did show the correct work on answer 3 this time.
That's more of a test of vision LLM ability to correctly identify and count things in an image than it is of mathematical reasoning.
If you look at the working of your non-Python example it gets most of the counts wrong - identifying shop A as two full notebooks plus one half notebook when it's actually three full notebooks, for example. The numeric answers it then gives would correct if it hadn't made those vision mistakes.
I've been testing vision LLMs on counting the number of pelicans in a photo for a while, they're very unreliable at that.
The best I've seen is Google Gemini 2.5 if you have it output image segmentation masks (a feature they have not included in the Gemini 3 series yet): https://simonwillison.net/2025/Apr/18/gemini-image-segmentat... - but that requires additional harness engineering, you need to explicitly cause it to use its image segmentation mechanism.
Fourth grade math's† students are learning geometry and how to draw simple plots. Vision ability (or tactile ability, for visually impaired students) is pretty important to understanding and solving those homework problems.
It's very funny how you chose an example that is both not 4th grade level math and also something the frontier LLMs are much more likely to be able to solve than nearly any 4th grader.
This is a counterexample to your argument, not evidence for your claim. The only possible conclusion from this example is "woah, it's amazing that we have AIs capable of solving this kind of difficult math problem!", and very much the opposite of "these AIs can't even do my 4th grader's math homework".
GPT-5.5 found a solution only after assuming that you're allowed to concatenate numbers together e.g. 8 7 becomes 87 (it complained at first that it was "under-specified") - using Python it brute-forced a solution (actually finding 13): https://chatgpt.com/share/6a1db54f-7ab8-8333-9218-86a469c284...
I questioned OP's "there is an answer online" claim so I checked and the only source found for the original question was a 5th grade Russian school for mathematics.
Apparently there is a way to solve this without brute forcing all the combinations. It has to do with looking at how many even an odd numbers there are, and taking into account the goal number is odd. And then thinking through the combinations [even-even=even, even-odd=odd,…]
Though this is obviously not something I would expect a 4th grader to solve.
Fourth graders typically don't have access to Python for their homework assignments. To be fair to the kids, I tried it first without Python: Opus 4.6 (Feb 2026) with default Medium effort. https://claude.ai/share/1533a3e4-6757-4614-b95d-0743350a6598
pastebin of the reasoning section (no Python): https://pastebin.com/zZeG5ZnJ
It got questions 2 (Shop D) and 5 (280) wrong. It got question 3 right but the work it showed has the numbers for each shop wrong. My fourth grade teacher would have taken off points for that (shout out Mrs. Van Bladel).
Here it is again with a prompted nudge to use Python: https://claude.ai/share/e1265efb-0988-40ac-90ac-c76225b67e98
pastebin of the reasoning section (with Python): https://pastebin.com/KsP0xxZL
This time it used Python to "check its work", and answered the same questions incorrectly (2 and 5). To the model's credit, it did show the correct work on answer 3 this time.