| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Kuinox 350 days ago
	It's specific model that run for maths. GPT-5 and Gemini 2.5 still cannot compute an arbitrary length sum of whole number without a calculator. I have a proceduraly generated benchmark of basic operations, LLMs gets better at it with time, but they cant still solve basic maths or logic problems. BTW I'm open to selling it, my email is on my hn profile.

4 comments

HappMacDonald 350 days ago

Have you ever seen what these arbitrary length whole numbers look like once they are tokenized? They don't break down to one-digit-per-token, and the same long number has no guarantee of breaking down into tokens the same way every time it is encountered.

But the algorithms they teach humans in school to do long-hand arithmetic (which are liable to be the only algorithms demonstrated in the training data) require a single unique numeral for every digit.

This is the same source as the problem of counting "R"'s in "Strawberry".

Kuinox 350 days ago

That's was the initial thinking of anyone which I explained this, it was also my speculation, but when you look in it's reasoning where it do the mistake, it correctly extract the digits out of the input token. As I say in another comments, most of the mistakes her happen when it recopy the answer it calculated from the summation table. You can avoid tokenization issue when it extract the answer by making it output an array of digits of the answer, it will still fail at simply recopying the correct digit.

Fade_Dance 350 days ago

I recently saw someone that posted a leaked system prompt for GPT5 (and regardless of the truth of the matter since I can't confirm the authenticity of the claim, the point I'm making stands alone to some degree).

A portion of the system prompt was specifically instructing the LLM that math problems are, essentially, "special", and that there is zero tolerance for approximation or imprecision with these queries.

To some degree I get the issue here. Most queries are full of imprecision and generalization, and the same type of question may even get a different output if asked in a different context, but when it comes to math problems, we have absolutely zero tolerance for that. To us this is obvious, but when looking from the outside, it is a bit odd that we are so loose and sloppy with, well basically everything we do, but then we put certain characters in a math format, and we are hyper obsessed with ultra precision.

The actual system prompt section for this was funny though. It essentially said "you suck at math, you have a long history of sucking at math in all contexts, never attempt to do it yourself, always use the calculation tools you are provided."

HappMacDonald 348 days ago

o/~ Mathematics keeps your intellect intact / many answers should be carefully exact

But for daily application, use a close approximation, round it off.. o/~

Jensson 349 days ago

> But the algorithms they teach humans in school to do long-hand arithmetic (which are liable to be the only algorithms demonstrated in the training data) require a single unique numeral for every digit.

But humans don't see single digits, we learn to parse noisy visual data into single digits and then use those single digits to do the math.

It is much easier for these models to understand what the number is based on the tokens and parse that than it is for a visual model to do it based on an image, so getting those tokens streamed straight into its system makes its problem to solve much much simpler than what humans do. We weren't born able to read numbers, we learn that.

bt1a 350 days ago

i'd wager your benchmark problems require cumbersome arithmetic or are poorly worded / inadequately described. or, you're mislabeling them as basic math and logic (a domain within which LLMs have proven their strengths!)

i only call this out because you're selling it and don't hypothesize* on why they fail your simple problems. i suppose an easily aced bench wouldn't be very marketable

Kuinox 350 days ago

This is a simple sum of 2 whole number, the number are simply big.

Most of the time they make a correct summation table but fail to copy correctly the sum result into a final result. That is not a tokenisation problem (you can change the output format to make sure of it). I have a separated benchmark that test specifically this, when the input is too large, the LLMs fails to accuratly copy the correct token. I suppose the positional embedding, are not perfectly learned and it sometimes cause a mistake.

The prompt is quite short, it use structured output, and I can generate a nice graph of % of good response accross difficulity of the question (which is just the total digit count of the input numbers.

LLMs have 100% success rate on theses sum until they reach a frontier, past that their accuracy collapse at various speed depending of the model.

bwfan123 349 days ago

This is close to what the apple paper [1] also found on constraint satisfaction problems. As an example, on towers of hanoi, past a frontier, accuracy collapses.

Even when the algorithm steps are laid out precisely, they cannot be followed. Perhaps, LLMs should be trained on turing machine specs and be given a tape lol.

Constraint satisfaction and combinatorics are where the search space is exponential, and the techniques are not formalized (not enough data in training set), and remain hard for machines as seen in the Problem 6 of IMO which could not be solved by LLMs. I suspect, there is this aspect of human intelligence which is not yet captured in LLMs.

[1] - https://machinelearning.apple.com/research/illusion-of-think...

energy123 350 days ago

Have you tried greedy decoding (temp 0) in aistudio?

The temp 0.7-1.0 defaults are not designed for reconstructing context with perfect accuracy.

Kuinox 349 days ago

I always use the lowest temperature that I can input. But GPT-5 doesn't support a temperature setting. You'll get something like:

{ "error": { "message": "Unsupported value: 'temperature' does not support 0.0 with this model. Only the default (1) value is supported.", "type": "invalid_request_error", "param": "temperature", "code": "unsupported_value" } }

KoolKat23 350 days ago

I can't see why that's necessary, when it can call a tool. Everyone uses a calculator. A logic problem, it can solve with reasoning, perhaps it's not the smartest but it can solve logic problems. All indications are that it will continue to become smarter.

Kuinox 349 days ago

Simple maths problems are simple logic problem. Here it doesn't even have to come up with a reasoning, it probably already memorised how to solve sums. Yet it fails at that, it shows it cannot solve logic problems if there are too much steps.

> All indications are that it will continue to become smarter.

I'm not disputing that, every new model score better at my benchmark, but right now, none truly "solve" one of these small logic problem.

KoolKat23 349 days ago

If it can frame the question for the tool, it therefore has the logic (whether that was static recall or deductive).

LLM's struggle with simple maths by nature of their architecture not due to a lack of logic. Yes it struggles with logic questions too but they're not directly related here.

Kuinox 349 days ago

Most of the failures for theses simple logic question come from the inability to simply copy data accuratly. Logic is too abstract to be measured, but this single bench show something getting in it's way. I got another bench that show that the LLMs do basic mistakes that can be easily avoided with minimum logic and observation.

Jensson 349 days ago

> LLM's struggle with simple maths by nature of their architecture not due to a lack of logic.

No, if it was good at logic it would have overcame that tiny architectural hurdle, its such a trivial process to convert tokens to numbers that it is ridiculous for you to suggest that is the reason it fails at math.

The reason it fails at math is because it fails at logic, and math is the most direct set of logic we have. It doesn't fail at converting between formats, it can convert strawberry to correct Base64 encoding, meaning it does know exactly what letters are there, it just lacks to logic to actually understand what "count letters" means.

KoolKat23 349 days ago

It can't see that data so how can it convert it? It can only see the token input.

An analogy (probably poor) is like asking a human to see UV light. We can do so but only with tools or by removing our lense.

The fact that SOTA models (not yet publicly available) can achieve gold at IOM implies otherwise.

Kuinox 349 days ago

It's because math problems allow to easily check that the solution is correct, it allow to do a lot of 'search': https://yellow-apartment-148.notion.site/AI-Search-The-Bitte...

gjm11 350 days ago

> GPT-5 and Gemini 2.5 still cannot compute an arbitrary length sum of whole number without a calculator.

Neither can many humans, including some very smart ones. Even those who can will usually choose to use a calculator (or spreadsheet or whatever) rather than doing the arithmetic themselves.

simoncion 350 days ago

> Neither can many humans...

1) GPT-5 is advertised as "PhD-level intelligence". So, I take OpenAI (and anyone else who advertises their bots with language like this) at their word about the bot's capabilities and constrain the set of humans I use for comparison to those who also have PhD-level intelligence.

2) Any human who has been introduced to long addition will absolutely be able to compute the sum of two whole numbers of arbitrary length. You may have to provide them a sufficiently strong incentive to actually do it long-hand, but they absolutely are capable because the method is not difficult. I'm fairly certain that most adult humans [0] (regardless of whether or not they have PhD-level intelligence) find the method to be trivial, if tedious.

[0] And many human children!

gjm11 349 days ago

I have a PhD, in mathematics, from a top university. If you give me, say, 100 10-digit numbers to add up and tell me to do the job in my head then I will probably get the answer wrong.

Of course, if you give me 100 10-digit numbers to add up and let me use a calculator, or pencil and paper, then I will probably get it right.

Same for, say, two 100-digit numbers. (I can probably get that one right without tools if you obligingly print them monospaced and put one of them immediately above the other where I can look at them.)

Anyway, the premise here seems to be simply false. I just gave ChatGPT and Claude (free versions of both; ChatGPT5, whatever specific model it routed my query to, and Sonnet 4) a list of 100 random 10-digit numbers to add up, with a prompt encouraging them to be careful about it but nothing beyond that (e.g., no specific strategies or tools to use), and both of them got the right total. Then I did the same with two 100-digit numbers and both of them got that right too.

Kuinox 349 days ago

https://i.imgur.com/l2elIAv.png

Difficulty is the amount of digits, small models struggle with 10 digits numbers, gemini and gpt-5 are very good recent models, gemini start failing before 40 digits, GPT-5 (the one by api, the online chat version is worse and I didn't tested it) can do more than 120 digits (at this point it's pointless to test for more).

gjm11 349 days ago

My tests of GPT-5 were using the online chat version.

Of course, I only ran it once; I can't at all rule out the possibility that sometimes it gets it wrong. But, again, the same is true of humans.

Kuinox 349 days ago

The online version is way worse, it also have a router that could route it to a random model.

mathiaspoint 350 days ago

Right but most (competent) humans will reliably use a calculator. It's difficult to get these to reliably make lots of tool calls like that.

Kuinox 350 days ago

I do think that competent humans can solve any arbitrary sum of 2 whole number with a pen, paper and time. LLMs can't do that.

rileymat2 350 days ago

That’s interesting, you added a tool. You did not just leave it to the human alone.

simoncion 350 days ago

I'm not the fellow you replied to, but I felt like stepping in.

> That’s interesting, you added a tool.

The "tool" in this case, is a memory aid. Because they are computer programs running inside a fairly-ordinary computer, the LLMs have exactly the same sort of tool available to them. I would find a claim that LLMs don't have a free MB or so of RAM to use as scratch space for long addition to be unbelievable.

gjm11 349 days ago

The fact that an LLM is running inside an ordinary computer does not mean that it gets to use all the abilities of that computer. They do not have megabytes of scratch space merely because the computer has a lot of memory.

They do have something a bit like it: their "context window", the amount of input and recently-generated output they get to look at while generating the next token. Claude Sonnet 4 has 1M tokens of context, but e.g. Opus 4.1 has only 200k and I think GPT-5 has 256k. And it doesn't really behave like "scratch space" in any useful sense; e.g., the models can't modify anything once it's there.

Kuinox 349 days ago

LLMs already get enough working memory, they do not fail because of lack of working space.