| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by btown 2 hours ago

For Claude, at least, "throw out the reasoning tokens" is only true when a session has been idle for more than an hour, and is new since March.

The basic concept is that for a session active recently, interleaved thinking tokens are already in KV cache, so it's more efficient to keep using them than not! But when resuming an older session where KV cache has been evicted, it's more expensive to restore the thinking tokens, so they're silently dropped from prior turns. It's 2026 and stateful servers are back on the menu!

https://www.anthropic.com/engineering/april-23-postmortem describes this as an intended optimization:

> The design should have been simple: if a session has been idle for more than an hour, we could reduce users’ cost of resuming that session by clearing old thinking sections. Since the request would be a cache miss anyway, we could prune unnecessary messages from the request to reduce the number of uncached tokens sent to the API. We’d then resume sending full reasoning history. To do this we used the clear_thinking_20251015 API header along with keep:1.

> The implementation had a bug. Instead of clearing thinking history once, it cleared it on every turn for the rest of the session... This surfaced as the forgetfulness, repetition, and odd tool choices people reported.

And https://news.ycombinator.com/item?id=47879561 is a thread with a Claude team member's further rationale.

> Eliding parts of the context after idle: old tool results, old messages, thinking. Of these, thinking performed the best, and when we shipped it, that's when we unintentionally introduced the bug in the blog post.

(Also, https://news.ycombinator.com/item?id=47884517 indicates OpenAI drops reasoning tokens "smartly" at its own election, which is likely a similar performance optimization.)

I've experimented with rules to have Claude Code be explicit about recapping its thinking tokens, including tool choices and approaches chosen and rejected, into actual message output, but this is lossy at best. And sometimes dropping reasoning tokens can give a session "fresh eyes" in a good way.

I just really don't like the lack of control, and it's a reminder of how ephemeral the current landscape is. The Claude giveth, and the Claude taketh away.

3 comments

Roritharr 2 hours ago

Thank you! This is much more nuanced than my understanding so far!

link

8note 59 minutes ago

its mostly annoying in that you give opus a big job, that should be able to run for hours on end, but instead it tries to stop and checkpoint at every soonest possible moment even though the rest of the work is well specced and ready to go.

then it waits for the hour and gets dumbed down

link

chacham15 1 hour ago

I think you're confusing two different axes. There is a difference between the cache state and the context state.

Imagine a conversation with turns X, Y, and Z. When the LLM "reasons" about the next token A it does: P(A | X,Y,Z) and then P(B | X,Y,Z,A), etc. It will eventually produce a result P(D | X,Y,Z,A,B,C). Instead of continuing the context from X,Y,Z,A,B,C it continues it from X,Y,Z so you have P(N | X,Y,Z,D). This is what is meant by dropping the reasoning. This is done to save cache context for the session.

This is a different thing than preserving the K/V state of P(N | X,Y,Z,D).

link

flaghacker 1 hour ago

No, I think the comment you're responding to is actually correct. Look at this quote from the Anthropic blog post again:

They clearly make the same distinction between the cache and the context. They're saying "we could reduce users’ cost of resuming that session by clearing old thinking sections". They intentionally created a behavior different between cached and uncached requests, specifically they clear thinking sections from the context for requests that miss the cache.

link