|
|
|
|
|
by trees101
69 days ago
|
|
From my reading, the official docs don’t support the strong
claim that frontier LLMs are explicitly RL-trained to “be lazy”
or conserve tokens as claimed in this thread. What they do document
is adaptive / hidden reasoning compute: OpenAI says reasoning
models allocate internal reasoning tokens and reasoning.effort
controls how many are used
(https://developers.openai.com/api/docs/guides/reasoning), and
Anthropic says adaptive thinking decides whether/how much to use
extended thinking based on request complexity, with effort as
soft guidance and max_tokens as the hard cap
(https://docs.anthropic.com/en/docs/build-with-claude/adaptiv...
hinking). So prompt wording may change how the same budget is
spent, but it can’t exceed the hard token cap. Also, the “encouragement helps” anecdote seems real in the
AlphaEvolve workflow, but I can't see that forpublic
models. Gómez-Serrano says this in Quanta
(https://www.quantamagazine.org/the-ai-revolution-in-math-has...
rived-20260413/), and the released AlphaEvolve notebooks really
do contain prompts like “Good luck, I believe in you...”
(https://github.com/google-deepmind/alphaevolve_repository_of...
oblems, e.g.
https://github.com/google-deepmind/alphaevolve_repository_of...
blems/blob/main/experiments/finite_field_kakeya_problem/finite_f
ield_kakeya.ipynb). But those prompts also bundled strong
structural hints (“find a general solution”, “better
constructions are possible”), so from my reading the evidence
is: prompt phrasing matters, especially in an internal search
stack, but not “pep talks are a universal reasoning hack.” |
|
Nothing I said contradicts this.
Here is the first attempt of what I'm testing. [0] Haiku can get the correct answer to `floor( (1234567 * 8901234) / 12345 )` or
``` Math.floor( (Math.floor(Math.random() * 9000000 + 1000000) * Math.floor(Math.random() * 9000000 + 1000000)) / Math.floor(Math.random() * 9000000 + 1000000) ) ```
Given this Haiku will give a correct answer 77.8% of the time. Add one digit or remove a digit, it is very highly predictable also.
That is the WHOLE point. The models are predictable!
Given that prompt Sonnet at 37-digit × 37-digit (~10³⁷) never quits a predictable percentage of the time!
And, Opus at 80-digit × 80-digit simply quits after 9 seconds and 333 tokens!
This is the amazing thing people are not discussing. The models are very predictable.
The AI companies are not posting this information because it shows how unreliable the models are, however, I think there is great virtue that the models are consistently unreliable.
[0] https://github.com/adam-s/agent-tuning/blob/main/application...