Hacker News new | ask | show | jobs
by gjm11 303 days ago
> GPT-5 and Gemini 2.5 still cannot compute an arbitrary length sum of whole number without a calculator.

Neither can many humans, including some very smart ones. Even those who can will usually choose to use a calculator (or spreadsheet or whatever) rather than doing the arithmetic themselves.

2 comments

> Neither can many humans...

1) GPT-5 is advertised as "PhD-level intelligence". So, I take OpenAI (and anyone else who advertises their bots with language like this) at their word about the bot's capabilities and constrain the set of humans I use for comparison to those who also have PhD-level intelligence.

2) Any human who has been introduced to long addition will absolutely be able to compute the sum of two whole numbers of arbitrary length. You may have to provide them a sufficiently strong incentive to actually do it long-hand, but they absolutely are capable because the method is not difficult. I'm fairly certain that most adult humans [0] (regardless of whether or not they have PhD-level intelligence) find the method to be trivial, if tedious.

[0] And many human children!

I have a PhD, in mathematics, from a top university. If you give me, say, 100 10-digit numbers to add up and tell me to do the job in my head then I will probably get the answer wrong.

Of course, if you give me 100 10-digit numbers to add up and let me use a calculator, or pencil and paper, then I will probably get it right.

Same for, say, two 100-digit numbers. (I can probably get that one right without tools if you obligingly print them monospaced and put one of them immediately above the other where I can look at them.)

Anyway, the premise here seems to be simply false. I just gave ChatGPT and Claude (free versions of both; ChatGPT5, whatever specific model it routed my query to, and Sonnet 4) a list of 100 random 10-digit numbers to add up, with a prompt encouraging them to be careful about it but nothing beyond that (e.g., no specific strategies or tools to use), and both of them got the right total. Then I did the same with two 100-digit numbers and both of them got that right too.

https://i.imgur.com/l2elIAv.png

Difficulty is the amount of digits, small models struggle with 10 digits numbers, gemini and gpt-5 are very good recent models, gemini start failing before 40 digits, GPT-5 (the one by api, the online chat version is worse and I didn't tested it) can do more than 120 digits (at this point it's pointless to test for more).

My tests of GPT-5 were using the online chat version.

Of course, I only ran it once; I can't at all rule out the possibility that sometimes it gets it wrong. But, again, the same is true of humans.

The online version is way worse, it also have a router that could route it to a random model.
Right but most (competent) humans will reliably use a calculator. It's difficult to get these to reliably make lots of tool calls like that.
I do think that competent humans can solve any arbitrary sum of 2 whole number with a pen, paper and time. LLMs can't do that.
That’s interesting, you added a tool. You did not just leave it to the human alone.
I'm not the fellow you replied to, but I felt like stepping in.

> That’s interesting, you added a tool.

The "tool" in this case, is a memory aid. Because they are computer programs running inside a fairly-ordinary computer, the LLMs have exactly the same sort of tool available to them. I would find a claim that LLMs don't have a free MB or so of RAM to use as scratch space for long addition to be unbelievable.

The fact that an LLM is running inside an ordinary computer does not mean that it gets to use all the abilities of that computer. They do not have megabytes of scratch space merely because the computer has a lot of memory.

They do have something a bit like it: their "context window", the amount of input and recently-generated output they get to look at while generating the next token. Claude Sonnet 4 has 1M tokens of context, but e.g. Opus 4.1 has only 200k and I think GPT-5 has 256k. And it doesn't really behave like "scratch space" in any useful sense; e.g., the models can't modify anything once it's there.

Well, the GPT-5 context windows offer roughly a little more than a MB
LLMs already get enough working memory, they do not fail because of lack of working space.