| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bt1a 307 days ago
	i'd wager your benchmark problems require cumbersome arithmetic or are poorly worded / inadequately described. or, you're mislabeling them as basic math and logic (a domain within which LLMs have proven their strengths!) i only call this out because you're selling it and don't hypothesize* on why they fail your simple problems. i suppose an easily aced bench wouldn't be very marketable

1 comments

Kuinox 306 days ago

This is a simple sum of 2 whole number, the number are simply big.

Most of the time they make a correct summation table but fail to copy correctly the sum result into a final result. That is not a tokenisation problem (you can change the output format to make sure of it). I have a separated benchmark that test specifically this, when the input is too large, the LLMs fails to accuratly copy the correct token. I suppose the positional embedding, are not perfectly learned and it sometimes cause a mistake.

The prompt is quite short, it use structured output, and I can generate a nice graph of % of good response accross difficulity of the question (which is just the total digit count of the input numbers.

LLMs have 100% success rate on theses sum until they reach a frontier, past that their accuracy collapse at various speed depending of the model.

link

bwfan123 306 days ago

This is close to what the apple paper [1] also found on constraint satisfaction problems. As an example, on towers of hanoi, past a frontier, accuracy collapses.

Even when the algorithm steps are laid out precisely, they cannot be followed. Perhaps, LLMs should be trained on turing machine specs and be given a tape lol.

Constraint satisfaction and combinatorics are where the search space is exponential, and the techniques are not formalized (not enough data in training set), and remain hard for machines as seen in the Problem 6 of IMO which could not be solved by LLMs. I suspect, there is this aspect of human intelligence which is not yet captured in LLMs.

[1] - https://machinelearning.apple.com/research/illusion-of-think...

link

energy123 306 days ago

Have you tried greedy decoding (temp 0) in aistudio?

The temp 0.7-1.0 defaults are not designed for reconstructing context with perfect accuracy.

link

Kuinox 306 days ago

I always use the lowest temperature that I can input. But GPT-5 doesn't support a temperature setting. You'll get something like:

{ "error": { "message": "Unsupported value: 'temperature' does not support 0.0 with this model. Only the default (1) value is supported.", "type": "invalid_request_error", "param": "temperature", "code": "unsupported_value" } }

link