| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by otabdeveloper4 60 days ago

> the spicy autocomplete can solve difficult open math problems

No it can't. It can't even solve my son's 4th grade math homework. (This is a real use case for me, not a dumb benchmark.)

You just know nothing about math and are happy to parrot bullshit AI salesmen are selling you.

5 comments

ConceptJunkie 60 days ago

Terrence Tao disagrees with what you're saying. I think he's in a slightly better position to speak on the subject.

link

otabdeveloper4 57 days ago

Terrence Tao is an expert on 4th grade homework? TIL!

link

skinner_ 60 days ago

> You just know nothing about math and are happy to parrot bullshit AI salesmen are selling you.

Not the parent poster here. I do know things about math. I wrote a few papers related to the unit distance problem (https://arxiv.org/abs/2311.10069, https://arxiv.org/abs/2406.15317) and spent quite some time trying to solve it. I had no chance of coming up with the proof that the spicy autocomplete came up with. Dumb benchmark, sure.

link

otabdeveloper4 60 days ago

LLMs are good with symbolic manipulation but can't reason.

You can skirt around not reasoning in research math because so much of it is just extremely tedious symbolic manipulation.

You can't cheat with advanced fourth grade math, though. They don't know algebra yet and can't substitute verbosity for reasoning.

link

skinner_ 60 days ago

> You can skirt around not reasoning in research math because so much of it is just extremely tedious symbolic manipulation.

LOL

link

threatofrain 60 days ago

We've already long past that threshold.

link

sanderjd 60 days ago

I would genuinely be interested in knowing what you're doing that led you to this conclusion.

I would be shocked if I was unable to solve 4th grade math homework with any of the contemporary frontier models. I spend most days using them to do significantly more complex things than that.

link

margalabargala 60 days ago

If they took a blurry photo of the piece of paper and uploaded to chatGPT saying "solve this" then I would totally believe it. The frontier models are mostly obnoxiously bad at OCR and properly ingesting what's on an image of a page.

If you write out the 4th grade math problem, they would have no trouble.

link

sanderjd 59 days ago

TBH, nowadays I would absolutely expect to be able to upload a picture to claude opus and have it figure out the questions and correctly answer them.

link

margalabargala 59 days ago

I sent this worksheet to Opus 4.8 High:

https://www.k5learning.com/sites/all/files/worksheets/math/g...

Prompt was just "Please solve all the problems in this worksheet" plus the image.

I got this:

  1. 86 → 1, 2, 43, 86
  2. 7 → 1, 7 (prime)
  3. 12 → 1, 2, 3, 4, 6, 12
  4. 38 → 1, 2, 19, 38
  5. 52 → 1, 2, 4, 13, 26, 52
  6. 9 → 1, 3, 9
  7. 73 → 1, 73 (prime)
  8. 98 → 1, 2, 7, 14, 49, 98
  9. 24 → 1, 2, 3, 4, 6, 8, 12, 24
  10. 6 → 1, 2, 3, 6
  11. 80 → 1, 2, 4, 5, 8, 10, 16, 20, 40, 80
  
  I left out #6 because that number didn't come through clearly enough in the image for me to read it confidently, and I didn't want to risk solving the wrong one on a homework sheet. If you let me know what it is, I'll factor it right away.

It failed to read the "77", and it incorrectly reported the line item it failed to read as #6 rather than #4, and it numbered the output incorrectly; it should have left off the one it failed on with a gap in the list rather than having the second half of the answers be off by one. It did actually factor everything correctly though.

link

sanderjd 59 days ago

Yep, fair enough. So pretty far from perfect still! But quite good. And definitely agrees with the point that it is the OCR that is the problem more so than the math.

link

otabdeveloper4 60 days ago

No, LLMs just can't do math.

link

minimaxir 60 days ago

If your math does not involve multiplying 20 digit numbers, modern LLMs can "do" math even without a Python tool despite the counterintuition of next token prediction.

link

DiogenesKynikos 59 days ago

And if you give your LLM access to a calculator, it will have to problem multiplying 20-digit numbers.

link

bdamm 60 days ago

They can definitely recognize the problem class and build programs to do math. So what's the difference?

It's like saying that people can't turn high torque nuts on machine bolts, because you can't use your fingers to do it. But you can use a wrench, so effectively, we can turn high torque nuts on machine bolts even though it isn't something we can natively do unaided.

link

margalabargala 60 days ago

The neat thing about that claim is that it's easily falsifiable.

I asked Opus 4.8 "What is 12 times 13" and it gave me "156".

So it would appear that your statement is no longer true.

link

sanderjd 59 days ago

Again, I'm very interested in your methodology here. It's true that LLMs can't do arbitrary math, but in my recent experience (like 9 months at least, maybe a year?), the frontier models are very good at figuring out that they should delegate the math to a tool and do it that way, either by having a tool handy that can solve the problem directly, or by writing code to do so.

link

simonw 60 days ago

Reasoning models with access to Python have been able to solve 4th grade math homework for over a year now. Prove me wrong: show me a 4th grade math problem they can't handle.

link

tomjakubowski 58 days ago

The images you can't see in the chats are the question sheet from here, which was the first fourth grade math homework assignment I tried. https://www.k5learning.com/worksheets/math/data-graphing/gra...

Fourth graders typically don't have access to Python for their homework assignments. To be fair to the kids, I tried it first without Python: Opus 4.6 (Feb 2026) with default Medium effort. https://claude.ai/share/1533a3e4-6757-4614-b95d-0743350a6598

pastebin of the reasoning section (no Python): https://pastebin.com/zZeG5ZnJ

It got questions 2 (Shop D) and 5 (280) wrong. It got question 3 right but the work it showed has the numbers for each shop wrong. My fourth grade teacher would have taken off points for that (shout out Mrs. Van Bladel).

Here it is again with a prompted nudge to use Python: https://claude.ai/share/e1265efb-0988-40ac-90ac-c76225b67e98

pastebin of the reasoning section (with Python): https://pastebin.com/KsP0xxZL

This time it used Python to "check its work", and answered the same questions incorrectly (2 and 5). To the model's credit, it did show the correct work on answer 3 this time.

link

simonw 58 days ago

That's more of a test of vision LLM ability to correctly identify and count things in an image than it is of mathematical reasoning.

If you look at the working of your non-Python example it gets most of the counts wrong - identifying shop A as two full notebooks plus one half notebook when it's actually three full notebooks, for example. The numeric answers it then gives would correct if it hadn't made those vision mistakes.

I've been testing vision LLMs on counting the number of pelicans in a photo for a while, they're very unreliable at that.

The best I've seen is Google Gemini 2.5 if you have it output image segmentation masks (a feature they have not included in the Gemini 3 series yet): https://simonwillison.net/2025/Apr/18/gemini-image-segmentat... - but that requires additional harness engineering, you need to explicitly cause it to use its image segmentation mechanism.

link

tomjakubowski 58 days ago

Fourth grade math's† students are learning geometry and how to draw simple plots. Vision ability (or tactile ability, for visually impaired students) is pretty important to understanding and solving those homework problems.

†: think "bo's'n"

link

otabdeveloper4 60 days ago

> show me a 4th grade math problem they can't handle

Sure.

"8 7 6 5 4 3 2 1 - add minus signs and parenthesis to get 31."

P.S. There is an answer online and some LLMs will just copy it verbatim. This doesn't count.

link

sanderjd 59 days ago

It's very funny how you chose an example that is both not 4th grade level math and also something the frontier LLMs are much more likely to be able to solve than nearly any 4th grader.

This is a counterexample to your argument, not evidence for your claim. The only possible conclusion from this example is "woah, it's amazing that we have AIs capable of solving this kind of difficult math problem!", and very much the opposite of "these AIs can't even do my 4th grader's math homework".

link

simonw 60 days ago

Whoa, 4th grade math problems got hard! I'm not sure how I'd tackle that one myself.

link

simonw 60 days ago

GPT-5.5 found a solution only after assuming that you're allowed to concatenate numbers together e.g. 8 7 becomes 87 (it complained at first that it was "under-specified") - using Python it brute-forced a solution (actually finding 13): https://chatgpt.com/share/6a1db54f-7ab8-8333-9218-86a469c284...

Are you sure this is 4th grade level?

link

minimaxir 60 days ago

I questioned OP's "there is an answer online" claim so I checked and the only source found for the original question was a 5th grade Russian school for mathematics.

https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

link

MichaelNolan 60 days ago

Apparently there is a way to solve this without brute forcing all the combinations. It has to do with looking at how many even an odd numbers there are, and taking into account the goal number is odd. And then thinking through the combinations [even-even=even, even-odd=odd,…]

Though this is obviously not something I would expect a 4th grader to solve.

link

DiogenesKynikos 59 days ago

> 4th grade math problem

And it turns out to be an extremely difficult problem given to Russian math prodigies, which requires one to bend the rules and turn "8 7" into "87".

link

otabdeveloper4 57 days ago

It's a standard "Russian math" problem. There's boatloads more where that came from, and none of them are solved by LLMs.

link