Hacker News new | ask | show | jobs
by sanderjd 17 days ago
I would genuinely be interested in knowing what you're doing that led you to this conclusion.

I would be shocked if I was unable to solve 4th grade math homework with any of the contemporary frontier models. I spend most days using them to do significantly more complex things than that.

1 comments

If they took a blurry photo of the piece of paper and uploaded to chatGPT saying "solve this" then I would totally believe it. The frontier models are mostly obnoxiously bad at OCR and properly ingesting what's on an image of a page.

If you write out the 4th grade math problem, they would have no trouble.

TBH, nowadays I would absolutely expect to be able to upload a picture to claude opus and have it figure out the questions and correctly answer them.
I sent this worksheet to Opus 4.8 High:

https://www.k5learning.com/sites/all/files/worksheets/math/g...

Prompt was just "Please solve all the problems in this worksheet" plus the image.

I got this:

  1. 86 → 1, 2, 43, 86
  2. 7 → 1, 7 (prime)
  3. 12 → 1, 2, 3, 4, 6, 12
  4. 38 → 1, 2, 19, 38
  5. 52 → 1, 2, 4, 13, 26, 52
  6. 9 → 1, 3, 9
  7. 73 → 1, 73 (prime)
  8. 98 → 1, 2, 7, 14, 49, 98
  9. 24 → 1, 2, 3, 4, 6, 8, 12, 24
  10. 6 → 1, 2, 3, 6
  11. 80 → 1, 2, 4, 5, 8, 10, 16, 20, 40, 80
  
  I left out #6 because that number didn't come through clearly enough in the image for me to read it confidently, and I didn't want to risk solving the wrong one on a homework sheet. If you let me know what it is, I'll factor it right away.
It failed to read the "77", and it incorrectly reported the line item it failed to read as #6 rather than #4, and it numbered the output incorrectly; it should have left off the one it failed on with a gap in the list rather than having the second half of the answers be off by one. It did actually factor everything correctly though.
Yep, fair enough. So pretty far from perfect still! But quite good. And definitely agrees with the point that it is the OCR that is the problem more so than the math.
No, LLMs just can't do math.
If your math does not involve multiplying 20 digit numbers, modern LLMs can "do" math even without a Python tool despite the counterintuition of next token prediction.
And if you give your LLM access to a calculator, it will have to problem multiplying 20-digit numbers.
They can definitely recognize the problem class and build programs to do math. So what's the difference?

It's like saying that people can't turn high torque nuts on machine bolts, because you can't use your fingers to do it. But you can use a wrench, so effectively, we can turn high torque nuts on machine bolts even though it isn't something we can natively do unaided.

The neat thing about that claim is that it's easily falsifiable.

I asked Opus 4.8 "What is 12 times 13" and it gave me "156".

So it would appear that your statement is no longer true.

Again, I'm very interested in your methodology here. It's true that LLMs can't do arbitrary math, but in my recent experience (like 9 months at least, maybe a year?), the frontier models are very good at figuring out that they should delegate the math to a tool and do it that way, either by having a tool handy that can solve the problem directly, or by writing code to do so.