Hacker News new | ask | show | jobs
by dataviz1000 61 days ago
> Anthropic says adaptive thinking decides whether/how much to use extended thinking based on request complexity, with effort as soft guidance and max_tokens as the hard cap

Nothing I said contradicts this.

Here is the first attempt of what I'm testing. [0] Haiku can get the correct answer to `floor( (1234567 * 8901234) / 12345 )` or

``` Math.floor( (Math.floor(Math.random() * 9000000 + 1000000) * Math.floor(Math.random() * 9000000 + 1000000)) / Math.floor(Math.random() * 9000000 + 1000000) ) ```

Given this Haiku will give a correct answer 77.8% of the time. Add one digit or remove a digit, it is very highly predictable also.

That is the WHOLE point. The models are predictable!

Given that prompt Sonnet at 37-digit × 37-digit (~10³⁷) never quits a predictable percentage of the time!

And, Opus at 80-digit × 80-digit simply quits after 9 seconds and 333 tokens!

This is the amazing thing people are not discussing. The models are very predictable.

The AI companies are not posting this information because it shows how unreliable the models are, however, I think there is great virtue that the models are consistently unreliable.

[0] https://github.com/adam-s/agent-tuning/blob/main/application...

1 comments

looks like you've done some thorough testing. Have you found that prompting reliably reduces premature quitting? And have you found that reducing premature quitting results in more accuracy?
Because these are probabilistic machines, they solve the same problem at a predictable rate. Even with different variables, the success rate stays consistent.

I only noticed the premature quitting issue recently and haven't tested it much yet. It's getting expensive to run Sonnet on hard multiplication problems. I let it run to 200k tokens and it still grinds without quitting.

But Opus has a different problem. Ask it to solve a Rubik's Cube and it will run for hours and never solve it. So there are definitely prompts that make it run forever. But if you tell it to break down multiplication using algorithms, it behaves differently. It can take really complicated calculus problems and break them into simpler ones. I can't stump it that way.

Here's the interesting thing. Even when Opus solves modular expressions by breaking them down like calculus, it still fails at a predictable rate. There's a constant failure rate no matter what you do at any level of complexity.

Models have a baseline failure rate that prompting can't change. You can change how they fail -- token burn or quitting early -- but the underlying limit stays the same.