| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by 6keZbCECT2uB 51 days ago

"On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6"

This makes no sense to me. I often leave sessions idle for hours or days and use the capability to pick it back up with full context and power.

The default thinking level seems more forgivable, but the churn in system prompts is something I'll need to figure out how to intentionally choose a refresh cycle.

7 comments

bcherny 51 days ago

Hey, Boris from the Claude Code team here.

Normally, when you have a conversation with Claude Code, if your convo has N messages, then (N-1) messages hit prompt cache -- everything but the latest message.

The challenge is: when you let a session idle for >1 hour, when you come back to it and send a prompt, it will be a full cache miss, all N messages. We noticed that this corner case led to outsized token costs for users. In an extreme case, if you had 900k tokens in your context window, then idled for an hour, then sent a message, that would be >900k tokens written to cache all at once, which would eat up a significant % of your rate limits, especially for Pro users.

We tried a few different approaches to improve this UX:

1. Educating users on X/social

2. Adding an in-product tip to recommend running /clear when re-visiting old conversations (we shipped a few iterations of this)

3. Eliding parts of the context after idle: old tool results, old messages, thinking. Of these, thinking performed the best, and when we shipped it, that's when we unintentionally introduced the bug in the blog post.

Hope this is helpful. Happy to answer any questions if you have.

dbeardsl 51 days ago

I appreciate the reply, but I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.

I feel like that is a choice best left up to users.

i.e. "Resuming this conversation with full context will consume X% of your 5-hour usage bucket, but that can be reduced by Y% by dropping old thinking logs"

giwook 51 days ago

Another way to think about it might be that caching is part of Anthropic's strategy to reduce costs for its users, but they are now trying to be more mindful of their costs (probably partly due to significant recent user growth as well as plans to IPO which demand fiscal prudence).

Perhaps if we were willing to pay more for our subscriptions Anthropic would be able to have longer cache windows but IDK one hour seems like a reasonable amount of time given the context and is a limitation I'm happy to work around (it's not that hard to work around) to pay just $100 or $200 a month for the industry-leading LLM.

Full disclosure: I've recently signed up for ChatGPT Pro as well in addition to my Claude Max sub so not really biased one way or the other. I just want a quality LLM that's affordable.

jimkleiber 51 days ago

I might be willing to pay more, maybe a lot more, for a higher subscription than claude max 20x, but the only thing higher is pay per token and i really dont like products that make me have to be that minutely aware of my usage, especially when it has unpredictability to it. I think there's a reason most telecoms went away from per minute or especially per MB charging. Even per GB, as they often now offer X GB, and im ok with that on phone but much less so on computer because of the unpredictability of a software update size.

Kinda like when restaurants make me pay for ketchup or a takeaway box, i get annoyed, just increase the compiled price.

giwook 50 days ago

For sure, I agree with that sentiment. It's interesting to consider the psychological component of that, like how "free shipping" is not really free, it's oftentimes just packaged into the price of the product but somehow it feels like we're getting a better deal.

I would not be surprised to see Anthropic, OpenAI etc head in the direction you mention as they mature and all of these datacenters currently undergoing construction come online in the next few years and drive down costs.

adam_patarino 50 days ago

Token anxiety is real mental overhead.

jimkleiber 50 days ago

That's the phrase i was looking for, thank you.

sharts 51 days ago

That doesn’t make sense to pay more for cache warming. Your session for the most part is already persisted. Why would it be reasonable to pay again to continue where you left off at any time in the future?

jeremyjh 51 days ago

Because it significantly increases actual costs for Anthropic.

If they ignored this then all users who don’t do this much would have to subsidize the people who do.

tikkabhuna 50 days ago

I’m coming at this as a complete Claude amateur, but caching for any other service is an optimisation for the company and transparent for the user. I don’t think I’ve ever used a service and thought “oh there’s a cache miss. Gotta be careful”.

I completely agree that it’s infeasible for them to cache for long periods of time, but they need to surface that information in the tools so that we can make informed decisions.

danso 51 days ago

Genuine question: is the cost to keep a persistent warmed cache for sessions idling for hours/days not significant when done for hundreds of thousands of users? Wouldn’t it pose a resource constraint on Anthropic at some point?

tmountain 50 days ago

Related question, is it at all feasible to store cache locally to offload memory costs and then send it over the wire when needed?

cadamsdotcom 51 days ago

Sure, it wouldn’t make sense if they only had one customer to serve :)

uoaei 50 days ago

Exactly, even in the throes of today's wacky economic tides, storage is still cheap. Write the model state immediately after the N context messages in cache to disk and reload without extra inference on the context tokens themselves. If every customer did this for ~3 conversations per user you still would only need a small fraction of a typical datacenter to house the drives necessary. The bottleneck becomes architecture/topology and the speed of your buses, which are problems that have been contended with for decades now, not inference time on GPUs.

jeremyjh 50 days ago

This has nothing to do with the cost of storage. Surprisingly, you are not better informed than Anthropic on the subject of serving AI inference models.

A sibling comment explains:

https://news.ycombinator.com/item?id=47886200

PeterStuer 50 days ago

It may be persisted but it is not live in the inference engine.

Folcon 48 days ago

The reason I've been querying the 1 hour is a user's quota resets are often longer than that, as a result I've seen situations where someone builds a large context, then hits their quota limit, waits 2+ hours, their cache is gone, their first message then eats 20%+ of their current session quota and the user doesn't want to compact as they're still trying to get the model into a good understanding of the problem, this seems to be a really painful consequence for users on anything less than a max plan which seems like an unintended consequence of Anthropic's own system design choices?

IE How their quota and caching interact with each other, it doesn't make pro and max a little different, it makes it significantly different by unintentionally penalising pro users

JumpCrisscross 51 days ago

> I was never under the impression that gaps in conversations would increase costs

The UI could indicate this by showing a timer before context is dumped.

vyr 51 days ago

a countdown clock telling you that you should talk to the model again before your streak expires? that's the kind of UX i'd expect from an F2P mobile game or an abandoned shopping cart nag notification

abustamam 51 days ago

Well sure if you put it that way, they're similar. But it's either you don't see it and you get surprised by increased quota usage, or you do see it and you know what it means. Bonus points if they let you turn it off.

No need to gamify it. It's just UI.

thinkmassive 51 days ago

Plenty of room for a middle ground, like a static timestamp per session that shows expiration time, without the distraction of a constantly changing UI element.

matheusmoreira 51 days ago

Why not an automated ping message that's cheap for the model to respond to?

cortesoft 51 days ago

Because the cache is held on anthropics side, and they aren't going to hold your context in cache indefinitely.

karsinkk 51 days ago

Yes!! A UI widget that shows how far along on the prompt cache eviction timelines we are would be great.

vanviegen 50 days ago

That sounds stressful.

But perhaps Claude Code could detect that you're actively working on this stuff (like typing a prompt or accessing the files modified by the session), and send keep-cache-alive pings based on that? Presumably these pings could be pretty cheap, as the kv-cache wouldn't need to be loaded back into VRAM for this. If that would work reliably, cache expiry timeouts could be more aggressive (5 min instead of an hour).

jimkleiber 51 days ago

I tried to hack the statusline to show this but when i tried, i don't think the api gave that info. I'd love if they let us have more variables to access in the statusline.

kiratp 51 days ago

By caching they mean “cached in GPU memory”. That’s a very very scarce resource.

Caching to RAM and disk is a thing but it’s hard to keep performance up with that and it’s early days of that tech being deployed anywhere.

Disclosure: work on AI at Microsoft. Above is just common industry info (see work happening in vLLM for example)

libraryofbabel 50 days ago

Nit: It doesn’t have to live in GPU memory. The system will use multiple levels of caching and will evict older cached data to CPU RAM or to disk if a request hasn’t recently come in that used that prefix. The problem is, the KV caches are huge (many GB) and so moving them back onto the GPU is expensive: GPU memory bandwidth is the main resource constraint in inference. It’s also slow.

The larger point stands: the cache is expensive. It still saves you money but Anthropic must charge for it.

Edit: there are a lot of comments here where people don't understand LLM prefix caching, aka the KV cache. That's understandable: it is a complex topic and the usual intuitions about caching you might have from e.g. web development don't apply: a single cache blob for a single request is in the 10s of GB at least for a big model, and a lot of the key details turn on the problems of moving it in and out of GPU memory. The contents of the cache is internal model state; it's not your context or prompt or anything like that. Furthermore, this isn't some Anthropic-specific thing; all LLM inference with a stable context prefix will use it because it makes inference faster and cheaper. If you want to read up on this subject, be careful as a lot of blogs will tell you about the KV cache as it is used within inference for an single request (a critical detail concept in how LLMs work) but they will gloss over how the KV cache is persisted between requests, which is what we're all talking about here. I would recommend Philip Kiely's new book Inference Engineering for a detailed discussion of that stuff, including the multiple caching levels.

computably 51 days ago

> I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.

You didn't do your due diligence on an expensive API. A naïve implementation of an LLM chat is going to have O(N^2) costs from prompting with the entire context every time. Caching is needed to bring that down to O(N), but the cache itself takes resources, so evictions have to happen eventually.

doesnt_know 51 days ago

How do you do "due diligence" on an API that frequently makes undocumented changes and only publishes acknowledgement of change after users complain?

You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.

dlivingston 51 days ago

What is being discussed is KV caching [0], which is used across every LLM model to reduce inference compute from O(n^2) to O(n). This is not specific to Claude nor Anthropic.

[0]: https://huggingface.co/blog/not-lain/kv-caching

computably 51 days ago

> How do you do "due diligence" on an API that frequently makes undocumented changes and only publishes acknowledgement of change after users complain?

1. Compute scaling with the length of the sequence is applicable to transformer models in general, i.e. every frontier LLM since ChatGPT's initial release.

2. As undocumented changes happen frequently, users should be even more incentivized to at least try to have a basic understanding of the product's cost structure.

> You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.

I think "internal technical implementation" is a stretch. Users don't need to know what a "transformer" is to understand the trade-off. It's not trivial but it's not something incomprehensible to laypersons.

tempest_ 51 days ago

I use CC, and I understand what caching means.

I have no idea how that works with a LLM implementation nor do I actually know what they are caching in this context.

libraryofbabel 50 days ago

They are caching internal LLM state, which is in the 10s of GB for each session. It's called a KV cache (because the internal state that is cached are the K and V matrices) and it is fundamental to how LLM inference works; it's not some Anthropic-specific design decision. See my other comment for more detail and a reference.

hakanderyal 51 days ago

CC can explain it clearly, which how I learned about how the inference stack works.

fragmede 50 days ago

> 99.99% of users won't even understand the words that are being used.

That's a bad estimate. Claude Code is explicitly a developer shaped tool, we're not talking generically ChatGPT here, so my guess is probably closer to 75% of those users do understand what caching is, with maybe 30% being able to explain prompt caching actually is. Of course, those users that don't understand have access to Claude and can have it explain what caching is to them if they're interested.

solarkraft 51 days ago

I somewhat disagree that this is due diligence. Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.

mpyne 51 days ago

> Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.

Does mmap(2) educate the developer on how disk I/O works?

At some point you have to know something about the technology you're using, or accept that you're a consumer of the ever-shifting general best practice, shifting with it as the best practice shifts.

websap 51 days ago

Does using print() in Python means I need to understand the Kernel? This is an absurd thought.

zem 51 days ago

mmap(2) and all its underlying machinery are open source and well documented besides.

computably 51 days ago

I would say this is abstracting the behavior.

margalabargala 51 days ago

Okay, sure. There's a dollar/intelligence tradeoff. Let me decide to make it, don't silently make Claude dumber because I forgot about a terminal tab for an hour. Just because a project isn't urgent doesn't mean it's not important. If I thought it didn't need intelligence I would use Sonnet or Haiku.

pixl97 50 days ago

"Gets mad because their is no option"

"Gets mad because when their is options the defaults suck"

"Gets mad because the options start massively increasing costs to areospace pricing"

margalabargala 50 days ago

Did you mean to reply to someone else? Or do you misunderstand the issue?

There is no option to avoid auto-dumbing after one hour of idle. I haven't complained about the cost at all, I'm happy to pay it.

So yeah, I'm mad because there's no option. The other two you mentioned don't apply.

someguyiguess 51 days ago

Yes. It’s perfectly reasonable to expect the user to know the intricacies of the caching strategy of their llm. Totally reasonable expectation.

jghn 51 days ago

To some extent I'd say it is indeed reasonable. I had observed the effect for a while: if I walked away from a session I noticed that my next prompt would chew up a bunch of context. And that led me to do some digging, at which point I discovered their prompt caching.

So while I'd agree with your sarcasm that expecting users to be experts of the system is a big ask, where I disagree with you is that users should be curious and actively attempting to understand how it works around them. Given that the tooling changes often, this is an endless job.

abustamam 51 days ago

> users should be curious and actively attempting to understand how it works

Have you ever talked with users?

> this is an endless job

Indeed. If we spend all our time learning what changed with all our tooling when it changes without proper documentation then we spend all our working lives keeping up instead of doing our actual jobs.

trinsic2 50 days ago

Agreed. systems work the way they work. Its up to the user to determining what those limitations are. I don't like the concept of molding software based on every expectation a user has. Sometimes that expectation is unwarranted. You can see this in game development. Regardless of expressed criticism, sometimes gamers don't know what they want or what they need. A game should be developed by the design goals of the team, not cater to every whim the player base wants. We have seen were that can go.

coldtea 51 days ago

It's not like they have a poweful all-knowing oracle that can explain it to them at their dispos... oh, wait!

esafak 51 days ago

They have to know that this could bite them and to ask the question first.

exac 51 days ago

It is more useful to read posts and threads like this exact thread IMO. We can't know everything, and the currently addressed market for Claude Code is far from people who would even think about caching to begin with.

kang 51 days ago

It seems you haven't done the due diligence on what part of the API is expensive - constructing a prompt shouldn't be same charge/cost as llm pass.

coldtea 51 days ago

It seems you haven't done the due diligence on what the parent meant :)

It's not about "constructing a prompt" in the sense of building the prompt string. That of course wouldn't be costly.

It is about reusing llm inference state already in GPU memory (for the older part of the prompt that remains the same) instead of rerunning the prompt and rebuilding those attention tensors from scratch.

kang 51 days ago

You not only skipped the diligence but confused everyone repeating what I said :(

that is what caching is doing. the llm inference state is being reused. (attention vectors is internal artefact in this level of abstraction, effectively at this level of abstraction its a the prompt).

The part of the prompt that has already been inferred no longer needs to be a part of the input, to be replaced by the inference subset. And none of this is tokens.

computably 51 days ago

I said "prompting with the entire context every time," I think it should be clear even to laypersons that the "prompting" cost refers to what the model provider charges you when you send them a prompt.

kovek 51 days ago

What if the cache was backed up to cold storage? Instead of having to recompute everything.

vanviegen 50 days ago

They probably already do that. But these caches can get pretty big (10s of GBs per session), so that adds up fast, even for cold storage.

kovek 50 days ago

10s of GBs? ( 1,000,000 context * 1,000 vector size ) ^ 2 = 1,000,000,000,000,000,000… oh wow.. I must be miscalculating

What about only storing the conversation and then recomputing the embeddings in the cache? Does that cost a lot? Doing a lot of matrix multiplication does not cost dollars of compute, especially on specialized hardware, right?

bontaq 51 days ago

How's that O(N^2)? How's it O(N) with caching? Does a 3 turn conversation cost 3 times as much with no caching, or 9 times as much?

jannyfer 51 days ago

I’m not sure that it’s O(N) with caching but this illustrates the N^2 part:

https://blog.exe.dev/expensively-quadratic

bontaq 51 days ago

If there was an exponential cost, I would expect to see some sort of pricing based on that. I would also expect to see it taking exponentially longer to process a prompt. I don't believe LLMs work like that. The "scary quadratic" referenced in what you linked seems to be pointing out that cache reads increase as your conversation continues?

If I'm running a database keeping track of a conversation, and each time it writes the entire history of the conversation instead of appending a message, are we calling that O(N^2) now?

raron 51 days ago

How big this cached data is? Wouldn't it be possible to download it after idling a few minutes "to suspend the session", and upload and restore it when the user starts their next interaction?

throwdbaaway 51 days ago

Should be about 10~20 GiB per session. Save/restore is exactly what DeepSeek does using its 3FS distributed filesystem: https://github.com/deepseek-ai/3fs#3-kvcache

With this much cheaper setup backed by disks, they can offer much better caching experience:

> Cache construction takes seconds. Once the cache is no longer in use, it will be automatically cleared, usually within a few hours to a few days.

cyanydeez 51 days ago

I often see a local model QWEN3.5-Coder-Next grow to about 5 GB or so over the course of a session using llamacpp-server. I'd better these trillion parameter models are even worse. Even if you wanted to download it or offload it or offered that as a service, to start back up again, you'd _still_ be paying the token cost because all of that context _is_ the tokens you've just done.

The cache is what makes your journey from 1k prompt to 1million token solution speedy in one 'vibe' session. Loading that again will cost the entire journey.

cortesoft 51 days ago

What they mean when they say 'cached' is that it is loaded into the GPU memory on anthropic servers.

You already have the data on your own machine, and that 'upload and restore' process is exactly what is happening when you restart an idle session. The issue is that it takes time, and it counts as token usage because you have to send the data for the GPU to load, and that data is the 'tokens'.

vanviegen 50 days ago

Wrong on both counts. The kv-cache is likely to be offloaded to RAM or disk. What you have locally is just the log of messages. The kv-cache is the internal LLM state after having processed these messages, and it is a lot bigger.

> upload and restore it when the user starts their next interaction

The data is the conversation (along with the thinking tokens).

There is no download - you already have it.

The issue is that it gets expunged from the (very expensive, very limited) GPU cache and to reload the cache you have to reprocess the whole conversation.

That is doable, but as Boris notes it costs lots of tokens.

vanviegen 50 days ago

You're quite confidently wrong! :-)

The kv-cache is the internal LLM state after having processed the tokens. It's big, and you do not have it locally.

miroljub 51 days ago

This sounds like a religious cult priest blaming the common people for not understanding the cult leader's wish, which he never clearly stated.

computably 50 days ago

A strange view. The trade-off has nothing to do with a specific ideology or notable selfishness. It is an intrinsic limitation of the algorithms, which anybody could reasonably learn about.

Sure, the exact choice on the trade-off, changing that choice, and having a pretty product-breaking bug as a result, are much more opaque. But I was responding to somebody who was surprised there's any trade-off at all. Computers don't give you infinite resources, whether or not they're "servers," "in the cloud," or "AI."

miroljub 50 days ago

He was surprised because it was not clearly communicated. There's a lot of theory behind a product that you could (or could not) better understand, but in the end, something like price doesn't have much to do with the theoretical and practical behavior of the actual application.

cyanydeez 51 days ago

It'd probably be helpful for power users and transparency to actually show how the cache is being used. If you run local models with llamacpp-server, you can watch how the cache slots fill up with every turn; when subagents spawn, you see another process id spin up and it takes up a cache slot; when the model starts slowing down is when the context grows (amd 395+ around 80-90k) and the cache loads are bigger because you've got all that.

So yeah, it doesn't take much to surface to the user that the speed/value of their session is ephemeral because to keep all that cache active is computationally expensive because ...

You're still just running text through a extremely complex process, and adding to that text and to avoid re-calculation of the entire chain, you need the cache.

bede 50 days ago

I too would far rather bear a token cost than have my sessions rot silently beneath my feet. I usually have ~5 running CC sessions, some of which I may leave for a week or two of inactivity at a time.

lochnessduck 50 days ago

Yes, me too. This is good to know, but basically it means I can’t rely on old conversations any more. Using a “handoff” file to try and start a new conversation is effectively the same thing as what they did under the hood. So yeah, you can’t rely on old conversations to be as informed when you pick it back up.

airstrike 50 days ago

same here, and I suspect there are dozens of us

winternewt 50 days ago

Instead of just dropping all the context, the system could also run a compaction (summarizing the entire convo) before dropping it. Better to continue with a summary than to lose everything.

Folcon 50 days ago

There's problems with this approach as well I've found

I'm really beginning to feel the lack of control when it's comes to context if I'm being honest

bcherny 50 days ago

Yes! This is what we’re trying next.

nixpulvis 51 days ago

How else would you implement it?

btown 51 days ago

Is there a way to say: I am happy to pay a premium (in tokens or extra usage) to make sure that my resumed 1h+ session has all the old thinking?

I understand you wouldn't want this to be the default, particularly for people who have one giant running session for many topics - and I can only imagine the load involved in full cache misses at scale. But there are other use cases where this thinking is critical - for instance, a session for a large refactor or a devops/operations use case consolidating numerous issue reports and external findings over time, where the periodic thinking was actually critical to how the session evolved.

For example, if N-4 was a massive dump of some relevant, some irrelevant material (say, investigating for patterns in a massive set of data, but prompted to be concise in output), then N-4's thinking might have been critical to N-2 not getting over-fixated on that dump from N-4. I'd consider it mission-critical, and pay a premium, when resuming an N some hours later to avoid pitfalls just as N-2 avoided those pitfalls.

Could we have an "ultraresume" that, similar to ultrathink, would let a user indicate they want to watch Return of the (Thin)king: Extended Edition?

CjHuber 51 days ago

I think it’s crazy that they do this, especially without any notice. I would not have renewed my subscription if I knew that they started doing this.

Especially in the analysis part of my work I don‘t care about the actual text output itself most of the time but try to make the model „understand“ the topic.

In the first phase the actual text output itself is worthless it just serves as an indicator that the context was processed correctly and the future actual analysis work can depend on it. And they‘re… just throwing most the relevant stuff out all out without any notice when I resume my session after a few days?

This is insane, Claude literally became useless to me and I didn’t even know it until now, wasting a lot of my time building up good session context.

There would be nothing lost if they said „If you click yes, we will prune your old thinking making Claude faster and saving you tons of tokens“. Most people would say yes probably so why not ask them… make it an env variable (that is announced not a secretly introduced one to opt out of something new!) or at least write it in a change log if they really don’t want to allow people to use it like before, so there‘d be chance to cancel the subscription in time instead of wasting tons of time on work patterns that not longer work

munk-a 51 days ago

Pointing at their terms of service will definitely be the instantly summoned defense (as would most modern companies) but the fact that SaaS can so suddenly shift the quality of product being delivered for their subscription without clear notification or explicitly re-enrollment is definitely a legal oversight right now and Italy actually did recently clamp down on Netflix doing this[1]. It's hard to define what user expectations of a continuous product are and how companies may have violated it - and for a long time social constructs kept this pretty in check. As obviously inactive and forgotten about subscriptions have become a more significant revenue source for services that agreement has been eroded, though, and the legal system has yet to catch up.

1. Specifically, this suite was about price increases without clear consideration for both parties - but the same justifications apply to service restrictions without corresponding price decreases.

https://fortune.com/2026/04/20/italian-court-netflix-refunds...

kiratp 51 days ago

OpenAI does this for all API calls

> Our systems will smartly ignore any reasoning items that aren’t relevant to your functions, and only retain those in context that are relevant. You can pass reasoning items from previous responses either using the previous_response_id parameter, or by manually passing in all the output items from a past response into the input of a new one.

https://developers.openai.com/api/docs/guides/reasoning

Disclosure - work on AI@msft

jetbalsa 51 days ago

So to defend a litte, its a Cache, it has to go somewhere, its a save state of the model's inner workings at the time of the last message. so if it expires, it has to process the whole thing again. most people don't understand that every message the ENTIRE history of the conversion is processed again and again without that cache. That conversion might of hit several gigs worth of model weights and are you expecting them to keep that around for /all/ of your conversions you have had with it in separate sessions?

3836293648 51 days ago

No? It's not because it's a cache, it's because they're scared of letting you see the thinking trace. If you got the trace you could just send it back in full when it got evicted from the cache. This is how open weight models work.

mpyne 51 days ago

The trace goes back fine, that's not the issue.

The issue is that if they send the full trace back, it will have to be processed from the start if the cache expired, and doing that will cause a huge one-time hit against your token limit if the session has grown large.

So what Boris talked about is stripping things out of the trace that goes back to regenerate the session if the cache expires. Doing this would help avert burning up the token limit, but it is technically a different conversation, so if CC chooses poorly on stripping parts of the context then it would lead to Claude getting all scatter-brained.

eknkc 51 days ago

I’m not familiar with the Claude API but OpenAI has an encrypted thking messages option. You get something that you can send back but it is encrypted. Not available on Anthropic?

reactordev 51 days ago

They are sending it back to the cache, the part you are missing is they were charging you for it.

CjHuber 51 days ago

No of course it’s unrealistic for them to hold the cache indefinitely and that’s not the point. You are keeping the session data yourself so you can continue even after cache expiry. The point I‘m making is that it made me very angry that without any announcement they changed behavior to strip the old thinking even when you have it in your session file. There is absolutely no reason to not ask the user about if they want this

And it’s part of a larger problem of unannounced changes it‘s just like when they introduced adaptive thinking to 4.6 a few weeks ago without notice.

Also they seem to be completely unaware that some users might only use Claude code because they are used to it not stripping thinking in contrast to codex.

Anyway I‘m happy that they saw it as a valid refund reason

rsfern 51 days ago

It seems like an opportunity for a hierarchical cache. Instead of just nuking all context on eviction, couldn’t there be an L2 cache with a longer eviction time so task switching for an hour doesn’t require a full session replay?

sfink 50 days ago

Living where? If it's in the GPU, then it's still taking up precious space that could be used for serving other sessions. If it's not in the GPU, then it doesn't help.

cyanydeez 51 days ago

what matters isn't that it's a cache; what matter is it's cached _in the GPU/NPU_ memory and taking up space from another user's active session; to keep that cache in the GPU is a nonstarter for an oversold product. Even putting into cold storage means they still have to load it at the cost of the compute, generally speaking because it again, takes up space from an oversold product.

FireBeyond 51 days ago

> There would be nothing lost if they said „If you click yes, we will prune your old thinking making Claude faster and saving you tons of tokens“. Most people would say yes probably so why not ask them

The irony is that Claude Design does this. I did a big test building a design system, and when I came back to it, it had in the chat window "Do you need all this history for your next block of work? Save 120K tokens and start a new chat. Claude will still be able to use the design system." Or words to that effect.

CjHuber 51 days ago

This is exactly what also confused me. I had the exact same prompt in Claude code as well, and the no option implies you can also keep the whole history. But clicking keep apparently only ever kept the user and assistant messages not the whole actual thinking parts of the conversation

elAhmo 51 days ago

Don't you have that by just resuming old convo?

The only issue is that it didn't hit the cache so it was expensive if you resume later.

eknkc 51 days ago

Not at the moment apparently. They remove the thinking messages when you continue after 1 hour. That was the whole idea of that change. So the LLM gets all your messages, its responses etc but not the thinking parts, why it generated that responses. You get a lobotomised session.

elAhmo 51 days ago

OK didn't know that. I also resume fairly old sessions with 100-200k of context, and I sometimes keep them active for a while (but with large breaks in between).

Still on Opus 4.6 with no adaptive thinking, so didn't really notice anything worse in the past weeks, but who knows.

tbrockman 51 days ago

Or generate tiny filler messages every hour until you come back to it.

trinsic2 51 days ago

Why cant you just build a project document that outlines that prompt that you want to do? Or have claude save your progress in memory so you can pick it up later? Thats what I do. It seems abhorrent to expect to have a running prompt that left idle for long periods of time just so you can pick up at a moments whim...

Terretta 51 days ago

You know that memory goes back into a prompt as context that wasn't cached, so... that just adds work.

Granted, the "memory" can be available across session, as can docs...

try-working 51 days ago

recursive-mode does just that: https://recursive-mode.dev/introduction

Terretta 51 days ago

This violates the principle of least surprise, with nothing to indicate Claude got lobotomized while it napped when so many use prior sessions as "primed context" (even if people don't know that's what they were doing or know why it works).

The purpose of spending 10 to 50 prompts getting Claude to fill the context for you is it effectively "fine tunes" that session into a place your work product or questions are handled well.

// If this notion of sufficient context as fine tune seems surprising, the research is out there.)

Approaches tried need to deal with both of these:

1) Silent context degradation breaks the Pro-tool contract. I pay compute so I don't pay in my time; if you want to surface the cost, surface it (UI + price tag or choice), don't silently erode quality of outcomes.

2) The workaround (external context files re-primed on return) eats the exact same cache miss, so the "savings" are illusory — you just pushed the cost onto the user's time. If my own time's cheap enough that's the right trade off, I shouldn't be using your machine.

uxcolumbo 51 days ago

I don't envy you Boris. Getting flak from all sorts of places can't be easy. But thanks for keeping a direct line with us.

I wish Anthropic's leadership would understand that the dev community is such a vital community that they should appreciate a bit more (i.e. not nice sending lawyers afters various devs without asking nicely first, banning accounts without notice, etc etc). Appreciate it's not easy to scale.

OpenAI seems to be doing a much better job when it comes to developer relations, but I would like to see you guys 'win' since Anthropic shows more integrity and has clear ethical red lines they are not willing to cross unlike OpenAI's leadership.

kuboble 51 days ago

As some others have mentioned.

I think the best option would be tell a user who is about to resurrect a conversation that has been evicted from cache that the session is not cached anymore and the user will have to face a full cost of replaying a session, not only the incremental question and answer.

(In understand under the hood that llms are n^2 by default but it's very counter intuitive - and given how popular cc is becoming outside of nerd circles, probably smaller and smaller fraction of users is aware of it)

I would like to decide on it case by case. Sometimes the session has some really deep insight I want to preserve, sometimes it's discardable.

a_t48 51 days ago

I got exactly this warning message yesterday, saying that it could use up a significant amount of my token budget if I resumed the conversation without compaction.

jhogendorn 51 days ago

Compaction wont save you, in fact calling compaction will eat about 3-5x the cold cache cost in usage ive found.

_flux 50 days ago

Wouldn't it help if the system did compaction before the eviction happens? But the problem is that Claude probably don't want to automatically compact all sessions that have been left idle for one hour (and very likely abandoned already), that would probably introduce even more additional costs.

Maybe the UI could do that for sessions that the user hasn't left yet, when the deadline comes near.

doubleunplussed 51 days ago

I saw that too, but that's actually even worse on cache - the entire conversation is then a cache miss and needs to be loaded in in order to do the compaction. Then the resulting compacted conversation is also a cache miss.

You ideally want to compact before the conversation is evicted from cache. If you knew you were going to use the conversation again later after cache expiry, you might do this deliberately before leaving a session.

Anthropic could do this automatically before cache expiry, though it would be hard to get right - they'd be wasting a lot of compute compacting conversations that were never going to be resumed anyway.

onemoresoop 51 days ago

Im glad they chose to do that as opposed to hidden behavior changes that only confuse users more.

fhub 51 days ago

Really good to know. That should have made it into their update letter in point (2). Empowering the user to choose is the right call.

skeledrew 51 days ago

> I think the best option would be tell a user who is about to resurrect a conversation that has been evicted from cache that the session is not cached anymore and the user will have to face a full cost of replaying a session

This feature has been live for a few days/weeks now, and with that knowledge I try remember to a least get a process report written when I'm for example close to the quota limit and the context is reasonably large. Or continue with a /compact, but that tends to lead to be having to repeat some things that didn't get included in the summary. Context management is just hard.

Terretta 51 days ago

Right, and reloading that context is the same cost as refilling the cache, so really, they're charging the same, and making it hard.

isaacdl 51 days ago

Thanks for giving more information. Just as a comment on (1), a lot of people don't use X/social. That's never going to be a sustainable path to "improve this UX" since it's...not part of the UX of the product.

It's a little concerning that it's number 1 in your list.

mtilsted 51 days ago

Then you need to update your documentation and teach claude to read the new documentation because here is what claude code answered:

Question: Hey claude, if we have a conversation, and then i take a break. Does it change the expected output of my next answer, if there are 2 hours between the previous message end the next one?

Answer: No. A 2-hour gap doesn't change my output. I have no internal clock between messages — I only see the conversation content plus the currentDate context injected each turn. The prompt cache may expire (5 min TTL), which affects cost/latency but not the response itself.

  The only things that can change output across a break: new context injected (like updated date), memory files being modified, or files on disk changing.

-- This answer directly contradict your post. It seems like the biggest problem is a total lack of documentation for expected behavior.

A similar thing happens if I ask claude code for the difference between plan mode, and accept edits on.

Then Claude told me the only difference was that with plan mode it would ask for permission before doing edits. But I really don't think this is true. It seems like plan mode does a lot more work, and present it in a total different way. It is not just a "I will ask before applying changes" mode.

ryeguy 51 days ago

This isn't how LLMs work. They aren't self aware like this, they're trained on the general internet. They might have some pointers to documentation for certain cases, but they generally aren't going to have specialized knowledge of themselves embedded within. Claude code has no need to know about its own internal programming, the core loop is just javascript code.

CjHuber 51 days ago

It does have an built in documentation subagent it can invoke but that doesn’t help much if they don’t document their shenanigans

hennell 50 days ago

Don't be silly, they don't expect you to ask the Ai questions and get the right answers. Obviously if you want to know what's going on you should look at their first solution - check what advice they have posted on X...

jwr 50 days ago

These controversies erupt regularly, and I hope that you will see a common thing with most of them: you make a decision for your users without informing them.

Please fight this hubris. Your users matter. Many of us use your tools for everyday work and do not appreciate having the rug pulled from under them on a regular basis, much less so in an underhanded and undisclosed way.

I don't mind the bugs, these will happen. What I do not appreciate is secretly changing things that are likely to decrease performance.

Kiro 50 days ago

A company that needs to anchor every single thing with the users will create a stale product.

jwr 50 days ago

That is not what I wrote. The phrases "without informing them", "in an underhanded and undisclosed way" and "secretly changing things" were important. I'm all for product evolution, but users should be informed when the product is changed, especially when the change can be for the worse (like dumbing down the model).

salawat 50 days ago

I've spent my entire working career dealing with companies that do the opposite. The product still goes stale. Find a better excuse.

You're acquiring users as a recurring revenue source. Consider stability and transparency of implementation details cost of doing business, or hemorrhage users as a result.

tomaskafka 50 days ago

While I hate all the gaslighting Anthropic seems to do recently (and the fact that their harness broke the code quality, while they forbid use of third party harnesses), making decisions for users is what UX is.

See also the difference between eg. MacOS (with large M, the older good versions) and waiting for "Year of linux on desktop".

I don't think the issue is making decisions for users, but trying to switch off the soup tap in the all-you-can-eat soup bar. Or, wrong business model setting wrong incentives to both sides.

saadn92 51 days ago

I leave sessions idle for hours constantly - that's my primary workflow. If resuming a 900k context session eats my rate limit, fine, show me the cost and let me decide whether to /clear or push through. You already show a banner suggesting /clear at high context - just do the same thing here instead of silently lobotomizing the model.

sdevonoes 51 days ago

So if they fuck it up again and now they have, let’s say, “db problems” instead of “caching problems”, you would happily simply pay more? Wtf

saadn92 51 days ago

No, I wouldn't. I'd like some transparency at least.

albedoa 51 days ago

Did you reply to the wrong comment? I don't see that implied here at all. What?

ceuk 51 days ago

Is having massive sessions which sit idle for hours (or days) at a time considered unusual? That's a really, really common scenario for me.

Two questions if you see this:

1) if this isn't best practice, what is the best way to preserve highly specific contexts?

2) does this issue just affect idle sessions or would the cache miss also apply to /resume ?

hedgehog 51 days ago

Have the tool maintain a doc, and use either the built-in memory or (I prefer it this way) your own. I've been pretty critical of some other aspects of how Claude Code works but on this one I think they're doing roughly the right thing given how the underlying completion machinery works.

Edit: If you message me I can share some of my toolchain, it's probably similar to what a lot of other people here use but I've done some polishing recently.

jetbalsa 51 days ago

The cache is stored on Antropics servers, since its a save state of the LLM's weights at the time of processing. its several gigs in size. Every SINGLE TIME you send a message and its a cache miss you have to reprocess the entire message again eating up tons of tokens in the process

cyanydeez 51 days ago

clarification though: the cache that's important to the GPU/NPU is loaded directly in the memory of the cards; it's not saved anywhere else. They could technically create cold storage of the tokens (vectors) and load that, but given how ephemeral all these viber coders are, it's unlikely there's any value in saving those vectors to load in.

So then it comes to what you're talking about, which is processing the entire text chain which is a different kind of cache, and generating the equivelent tokens are what's being costed.

But once you realize the efficiency of the product in extended sessions is cached in the immediate GPU hardware, then it's obvious that the oversold product can't just idle the GPU when sessions idle.

fidrelity 51 days ago

Just wanted to say I appreciate your responses here. Engaging so directly with a highly critical audience is a minefield that you're navigating well.

Thank you.

qsort 51 days ago

I agree with this.

I'm writing this message even though I don't have much to add because it's often the case on HN that criticism is vocal and appreciation is silent and I'd like to balance out the sentiment.

Anthropic has fumbled on many fronts lately but engaging honestly like this is the right thing to do. I trust you'll get back on track.

troupo 51 days ago

> Engaging so directly with a highly critical audience is a minefield that you're navigating well.

They spent two months literally gaslighting this "critical audience" that this could not be happening and literally blaming users for using their vibe-coded slop exactly as advertised.

All the while all the official channels refused to acknowledge any problems.

Now the dissatisfaction and subscription cancellations have reached a point where they finally had to do something.

rob 51 days ago

Examples of gaslighting on April 15th (the first 2 issues were "fixed" by April 10th according to the story):

https://x.com/bcherny/status/2044291036860874901 https://x.com/bcherny/status/2044299431294759355

No mention of anything like "hey, we just fixed two big issues, one that lasted over a month." Just casual replies to everybody like nothing is wrong and "oh there's an issue? just let us know we had no idea!"

troupo 51 days ago

Don't forget "our investigation concluded you are to blame for using the product exactly as advertised" https://x.com/lydiahallie/status/2039800718371307603 including gems like "Sonnet 4.6 is the better default on Pro. Opus burns roughly twice as fast. Switch at session start"

shimman 51 days ago

Very easy to do when you stand to make tens of millions when your employer IPOs. Let's not maybe give too much praise and employ some critical thinking here.

simplify 51 days ago

What is the purpose of this mindset? Should we encourage typical corporate coldness instead?

sdevonoes 51 days ago

We should encourage minimal dependency on multibillion tech companies like anthropic. They, and similar companies are just milking us… but since their toys are soo shiny, we don’t care

simplify 51 days ago

Sure, but that seems out of scope of the original comment.

hgoel 51 days ago

Is "employ some critical thinking" supposed to involve being an annoying uptight cynic?

artdigital 51 days ago

I'm also a Claude Code user from day 1 here, back from when it wasn't included in the Pro/Max subscriptions yet, and I was absolutely not aware of this either. Your explanation makes sense, but I naively was also under the impression that re-using older existing conversations that I had open would just continue the conversation as is and not be a treated as a full cache miss.

My biggest learning here is the 1 hour cache window. I often have multiple Claudes open and it happens frequently that they're idle for 1+ hours.

This cache information should probably get displayed somewhere within Claude Code

bcherny 51 days ago

Yep, agree. We added a little "/clear to save XXX tokens" notice in the bottom right, and will keep iterating on this. Thanks for being an early user!

Implicated 51 days ago

But.. that doesn't solve the problem of having no indication in-session when it'll lose the cache. A nudge to /clear does nothing to indicate "or else face significant cost" nor does it indicate "your cache is stale".

Love the product. <3

troupo 50 days ago

Instead of showing actual usage, costs and cache status you spent two months denying the issue even exists, making the product silently worse, and now you're "iterating on this"

troupo 50 days ago

To add to this. The new indicator is "New task? /clear to save <X> tokens" even though it affects all tasks, not just new ones.

Mislead, gaslight, misdirect is the name of the game

troupo 51 days ago

> We tried a few different approaches to improve this UX: 1. Educating users on X/social

No. You had random developers tweet and reply at random times to random users while all of your official channels were completely silent. Including channels for people who are not terminally online on X

Terretta 51 days ago

There's a cultural divide between SV and the 85% of SMB using M365, for example. When everyone you know uses a thing, I mean, who doesn't?*

There's a reason live service games have splash banners at every login. No matter what you pick as an official e-coms channel, most of your users aren't there!

* To be fair, of all these firms, ANTHROP\C tries the hardest to remember, and deliver like, some people aren't the same. Starting with normals doing normals' jobs.

bobkb 51 days ago

Resuming sessions after more than 1 hour is a very common workflow that many teams are following. It will be great if this is considered as an expected behaviour and design the UX around it. Perhaps you are not realising the fact that Claude code has replaced the shells people were using (ie now bash is replaced with a Claude code session).

gib444 50 days ago

> Resuming sessions after more than 1 hour is a very common workflow that many teams are following

Yeah it's called lunch!

trinsic2 51 days ago

I think thats a bad idea. It seems like expecting to have a prompt open like this, accumulating context puts a load on the back end. Its one of those things that is a bad habit. Like trying to maintain open tabs in a browser as a way to keep your work flow up to date when what you really should be doing is taking notes of your process and working from there.

I have project folders/files and memory stored for each session, when I come back to my projects the context is drawn from the memory files and the status that were saved in my project md files.

Create a better workflow for your self and your teams and do it the right way. Quick expect the prompt to store everything for you.

For the Claude team. If you havent already, I'd recommend you create some best practices for people that don't know any better, otherwise people are going to expect things to be a certain way and its going to cause a lot of friction when people cant do what the expect to be able to do.

troupo 50 days ago

> I think thats a bad idea. It seems like expecting to have a prompt open like this, accumulating context puts a load on the back end

Let's see what Boris Cherny himself and other Anthropic vibe-coders say about this:

https://x.com/bcherny/status/2044847849662505288

Opus 4.7 loves doing complex, long-running tasks like deep research, refactoring code, building complex features, iterating until it hits a performance benchmark.

https://x.com/bcherny/status/2007179858435281082

For very long-running tasks, I will either (a) prompt Claude to verify its work with a background agent when it's done... so Claude can cook without being blocked on me.

https://x.com/trq212/status/2033097354560393727

Opus 4.6 is incredibly reliable at long running tasks

https://x.com/trq212/status/2032518424375734646

The long context window means fewer compactions and longer-running sessions. I've found myself starting new sessions much less frequently with 1 million context.

https://x.com/trq212/status/2032245598754324968

I used to be a religious /clear user, but doing much less now, imo 4.6 is quite good across long context windows

---

I could go on

kiratp 51 days ago

Agents making forward progress hours apart is an expected pattern and inference engines are being adapted to serve that purpose well.

It’s hard to do it without killing performance and requires engineering in the DC to have fast access to SSDs etc.

Disclosure: work on ai@msft. Opinions my own.

kccqzy 51 days ago

This just does not match my workflow when I work on low-priority projects, especially personal projects when I do them for fun instead of being paid to do them. With life getting busy, I may only have half an hour each night with Claude to make some progress on it before having to pause and come back the next day. It’s just the nature of doing personal projects as a middle-aged person.

The above workflow basically doesn’t hit the rate limit. So I’d appreciate a way to turn off this feature.

ryanisnan 51 days ago

Why does the system work like that? Is the cache local, or on Claude's servers?

Why not store the prompt cache to disk when it goes cold for a certain period of time, and then when a long-lived, cold conversation gets re-initiated, you can re-hydrate the cache from disk. Purge the cached prompts from disk after X days of inactivity, and tell users they cannot resume conversations over X days without burning budget.

jetbalsa 51 days ago

The cache is on Antropics server, its like a freeze frame of the LLM inner workings at the time. the LLM can pick up directly from this save state. as you can guess this save state has bits of the underlying model, their secret sauce. so it cannot be saved locally...

dicethrowaway1 51 days ago

Maybe they could let users store an encrypted copy of the cache? Since the users wouldn't have Anthropic's keys, it wouldn't leak any information about the model (beyond perhaps its number of parameters judging by the size).

jetbalsa 51 days ago

I'm unsure of the sizes needed for prompt cache, but I suspect its several gigs in size (A percentage of the model weight size), how would the user upload this every time they started a resumed a old idle session, also are they going to save /every/ session you do this with?

im3w1l 51 days ago

A few gigs of disk is not that expensive. Imo they should allocate every paying user (at least) one disk cache slot that doesn't expire after any time. Use it for their most recent long chat (a very short question-answer that could easily be replayed shouldn't evict a long convo).

skissane 51 days ago

They could let you nominate an S3 bucket (or Azure/GCP/etc equivalent). Instead of dropping data from the cache, they encrypt it and save it to the bucket; on a cache miss they check the bucket and try to reload from it. You pay for the bucket; you control the expiry time for it; if it costs too much you just turn it off.

northern-lights 51 days ago

Encryption can only ensure the confidentiality of a message from a non-trusted third party but when that non-trusted third party happens to be your own machine hosting Claude Code, then it is pointless. You can always dump the keys (from your memory) that were used to encrypt/decrypt the message and use it to reconstruct the model weights (from the dump of your memory).

dicethrowaway1 51 days ago

jetbalsa said that the cache is on Anthropic's server, so the encryption and decryption would be server-side. You'd never see the encryption key, Anthropic would just give you an encrypted dump of the cache that would otherwise live on its server, and then decrypt with their own key when you replay the copy.

iidsample 51 days ago

We at UT-Austin have done some academic work to handle the same challenge. Will be curious if serving engines could modified. https://arxiv.org/abs/2412.16434 .

The core idea is we can use user-activity at the client to manage KV cache loading and offloading. Happy to chat more!!

bshanks 50 days ago

The main issue here is not UX, but rather that you did something which degraded quality without transparency. You should have documented this and also highlighted the change in an announcement. There should never be an undocumented change that reduces quality. There should never be something the user can do (or fail to do) that reduces quality without that being documented. To regain trust, Anthropic should make an announcement committing to documenting/announcing any future intentional quality-reducing changes.

In addition, the following is less important, but as other commenters have stated: walking away from a conversation and coming back to it more than an hour later is very common and it would be nice if there were a way for the user to opt to retain maximum quality (e.g. no dropped thinking) in this case. In the longer term, it would be nice if there were a way for the user to wait a few minutes for a stale session to resume, in exchange for not having a large amount of quota drained (ie have a 'slow mode' invoked upon session resumption that consumes less quota).

Joeri 51 days ago

This sounds like one of those problems where the solution is not a UX tweak but an architecture change. Perhaps prompt cache should be made long term resumable by storing it to disk before discarding from memory?

kivle 51 days ago

I agree.. Maybe parts of the cache contents are business secrets.. But then store a server side encrypted version on the users disk so that it can be resumed without wasting 900k tokens?

slashdave 51 days ago

Disk where? LLM requests are routed dynamically. You might not even land in the same data center.

FuckButtons 51 days ago

But if you have a tiered cache, then waiting several seconds / minutes is still preferable to getting a cache miss. I suspect the larger problem is the amount of tinkering they are doing with the model makes that not viable.

8note 51 days ago

reasonably, if i'm in an interactive session, its going to have breaks for an hour or more.

whats driving the hour cache? shouldnt people be able to have lunch, then come back and continue?

are you expecting claude code users to not attend meetings?

I think product-wise you might need a better story on who uses claude-code, when and why.

Same thing with session logs actually - i know folks who are definitely going to try to write a yearly RnD report and monthly timesheets based on text analysis of their claude code session files, and they're going to be incredibly unhappy when they find out its all been silently deleted

FuckButtons 51 days ago

As with everything Anthropic recently this is a supply constraint issue. They have not planned for scale adequately.

QuantumGood 50 days ago

Prioritize outcomes for users using your product. That should lead to improving the viral/visibility aspect of documentation notification, as well as other aspects of documentation. Make this a differentiator of your product. Widespread misperceptions hurt outcomes.

Could you create one location educating advanced users, and:

• Promote, Organize and Maintain it

• Develop a group of users that have early access to "upcoming notifications we're working on"

• Perhaps give a third party specializing in making information visible responsibility for it

• Read comments by users in various places to determine what should be communicated. Just under this comment @dbeardsl begins "I appreciate the reply, but I was never under the impression that ...".

The speed that key users are informed of issues is critical. This is just off the top of my head, a much better plan I'm sure could be created.

ohcmon 51 days ago

Boris, wait, wait, wait,

Why not use tired cache?

Obviously storage is waaay cheaper than recalculation of embeddings all the way from the very beginning of the session.

No matter how to put this explanation — it still sounds strange. Hell — you can even store the cache on the client if you must.

Please, tell me I’m not understanding what is going on..

otherwise you really need to hire someone to look at this!)

krackers 51 days ago

Same question I had in https://news.ycombinator.com/item?id=47819914

I still don't understand it, yes it's a lot of data and presumably they're already shunting it to cpu ram instead of keeping it on precious vram, but they could go further and put it on SSD at which point it's no longer in the hotpath for their inference.

rkuska 51 days ago

I don't think you can store the cache on client given the thinking is server side and you only get summaries in your client (even those are disabled by default).

sargunv 51 days ago

If they really need to guard the thinking output, they could encrypt it and store it client side. Later it'd be sent back and decrypted on their server.

But they used to return thinking output directly in the API, and that was _the_ reason I liked Claude over OpenAI's reasoning models.

solarkraft 51 days ago

I assume they are already storing the cache on flash storage instead of keeping it all in VRAM. KV caches are huge - that’s why it’s impractical to transfer to/from the client. It would also allow figuring out a lot about the underlying model, though I guess you could encrypt it.

What would be an interesting option would be to let the user pay more for longer caching, but if the base length is 1 hour I assume that would become expensive very quickly.

tonyarkles 51 days ago

Just to contextualize this... https://lmcache.ai/kv_cache_calculator.html. They only have smaller open models, but for Qwen3-32B with 50k tokens it's coming up with 7.62GB for the KV cache. Imagining a 900k session with, say, Opus, I think it'd be pretty unreasonable to flush that to the client after being idle for an hour.

2001zhaozhao 51 days ago

I wonder whether prompt caches would be the perfect use case of something like Optane.

It's kept for long enough that it's expensive to store in RAM, but short enough that the writes are frequent and will wear down SSD storage

ohcmon 51 days ago

Yes — encryption is the solution for client side caching.

But even if it’s not — I can’t build a scenario in my head where recalculating it on real GPUs is cheaper/faster than retrieving it from some kind of slower cache tier

the-grump 51 days ago

That is understandable, but the issue is the sudden drop in quality and the silent surge in token usage.

It also seems like the warning should be in channel and not on X. If I wanted to find out how broken things are on X, I'd be a Grok user.

toephu2 51 days ago

How does the Claude team recommend devs use Claude Code?

1) Is it okay to leave Claude Code CLI open for days?

2) Should we be using /clear more generously? e.g., on every single branch change, on every new convo?

BoppreH 51 days ago

Isn't that exactly what people had been accusing Anthropic of doing, silently making Claude dumber on purpose to cut costs? There should be, at minimum, a warning on the UI saying that parts of the context were removed due to inactivity.

frumplestlatz 51 days ago

The entire reason I keep a long-lived session around is because the context is hard-won — in term of tokens and my time.

Silently degrading intelligence ought to be something you never do, but especially not for use-cases like this.

I’m looking back at my past few weeks of work and realizing that these few regressions literally wasted 10s of hours of my time, and hundreds of dollars in extra usage fees. I ran out of my entire weekly quota four days ago, and had to pause the personal project I was working on.

I was running the exact same pipeline I’ve run repeatedly before, on the same models, and yet this time I somehow ate a week’s worth of quota in less than 24h. I spent $400 just to finish the pipeline pass that got stuck halfway through.

I’m sorry to be harsh, but your engineering culture must change. There are some types of software you can yolo. This isn’t one of them. The downstream cost of stupid mistakes is way, way too high, and far too many entirely avoidable bugs — and poor design choices — are shipping to customers way too often.

deaux 51 days ago

> The entire reason I keep a long-lived session around is because the context is hard-won — in term of tokens and my time. Silently degrading intelligence ought to be something you never do, but especially not for use-cases like this.

Hard agree, would like to see a response to this.

8note 51 days ago

as a variation:

how does this help me as a customer? if i have to redo the context from scratch, i will pay both the high token cost again, but also pay my own time to fill it.

the cost of reloading the window didnt go away, it just went up even more

FireBeyond 51 days ago

> I’m sorry to be harsh, but your engineering culture must change. There are some types of software you can yolo. This isn’t one of them. The downstream cost of stupid mistakes is way, way too high, and far too many entirely avoidable bugs — and poor design choices — are shipping to customers way too often.

I have to imagine this isn't helped by working somewhere where you effectively have infinite tokens and usage of the product that people are paying for, sometimes a lot.

andrewingram 50 days ago

This points to a fairly fundamental mismatch between the realities of running an LLM and the expectations of users. As a user, I _expect_ the cost of resuming X hours/days later to be no different to resuming seconds or minutes later. The fact that there is a difference, means it's now being compensated for in fairly awkward ways -- none of the solutions seem good, just varying degrees of bad.

Is there a more fundamental issue of trying to tie something with such nuanced costs to an interaction model which has decades of prior expectation of every message essentially being free?

bavell 50 days ago

> As a user, I _expect_ the cost of resuming X hours/days later to be no different to resuming seconds or minutes later.

As an informed user who understands his tools, I of course expect large uncached conversations to massively eat into my token budget, since that's how all of the big LLM providers work. I also understand these providers are businesses trying to make money and they aren't going to hold every conversation in their caches indefinitely.

andrewingram 50 days ago

I'd hazard a guess that there's a large gulf between proportion of users who know as much as you, and the total number using these tools. The fact that a message can perform wildly differently (in either cost, or behaviour if using one of the mitigations) based on whether I send it at t vs t+1 seems like a major UX issue, especially given t is very likely not exposed in the UI.

bavell 49 days ago

I definitely agree that it should be shown and obvious in the UI. They do show a warning now when resuming old sessions but still could be better.

gverrilla 51 days ago

I drop sessions very frequently to resume later - that's my main workflow with how slow Claude is. Is there anything I can do to not encounter this cache problem?

try-working 51 days ago

You created this issue by setting a timer for cache clearing. Time is really not a dimension that plays any role in how coding agent context is used.

willsmith72 51 days ago

Wow so that's why you did #2? The explanation in the CLI is really not clear. I thought it was just a suggestion to compact, no idea it was way more expensive than if I hadn't left it idle for an hour.

You guys really need to communicate that better in the CLI for people not on social

dnnddidiej 51 days ago

It is too suprising. Time passed should not matter for using AI.

Either swallow the cost or be transparent to the user and offer both options each time.

Confiks 51 days ago

So you made this change completely invisible to the user, without the user being able to choose between the two behaviors, and without even documenting it in the (extremely verbose) changelog [1]? I can't find it, the Docs Assistant can't find it (well, it "I found it!" three times being fed your reply with a non-matching item).

I frequently debug issues while keeping my carefully curated but long context active for days. Losing potentially very important context while in the middle of a debugging session resulting in less optimal answers, is costing me a lot more money than the cache misses would.

In my eyes, Claude Code is mainly a context management tool. I build a foundation of apparent understanding of the problem domain, and then try to work towards a solution in a dialogue. Now you tell me Anthrophic has been silently breaking down that foundation without telling me, wasting potentially hours of my time.

It's a clear reminder that these closed-source harnesses cannot be trusted (now or in the future), and I should find proper alternatives for Claude Code as soon as possible.

[1] https://code.claude.com/docs/en/changelog

6keZbCECT2uB 49 days ago

I can see how this makes sense as a default behavior for cost conscious users. I would prefer to have the option for my company to pay more to rehydrate the cache than to have there be a model performance difference when having idled for an hour.

"We tried a few different approaches to improve this UX:

1. Educating users on X/social

2. Adding an in-product tip to recommend running /clear when re-visiting old conversations (we shipped a few iterations of this)

3. Eliding parts of the context after idle: old tool results, old messages, thinking. Of these, thinking performed the best, and when we shipped it, that's when we unintentionally introduced the bug in the blog post."

I see how these interventions help users reduce their token burn rate, but they don't address the need for an enterprise user to maintain quality.

A common workflow for me is kick off a prompt, commute home, eat dinner, follow up on prompt. Frequently 80K tokens or less in the context, frequently > 3 hours. Or when running multiple sessions it's easy to let a session idle for a few hours while I focus on one. Or many meetings might mean idle time for an hour.

Also, for enterprise users, I don't think education on X is a great place. There are people upskilling on this that never intentionally go on X.

First thing that comes to mind would be a weekly tip feed of footguns and underutilized functionality published to an anthropic website. "The Old New Thing" "Guru of the Week" "Abseil tips of the week" all have that format.

looshch 50 days ago

> We tried a few different approaches to improve this UX

how about acknowledging that you fucked up your own customers’ money and making a full refund for the affected period?

> Educating users on X/social

that is beyond me

ты не Борис, ты максимум борька

winternewt 50 days ago

> Adding an in-product tip to recommend running /clear when re-visiting old conversations (we shipped a few iterations of this)

I feel like I'm missing something here. Why would I revisit an old conversation only to clear it?

To me it sounds like a prompt-cache miss for a big context absolutely needs to be a per-instance warning and confirmation. Or even better a live status indicating what sending a message will cost you in terms of input tokens.

mandeepj 51 days ago

> that would be >900k tokens written to cache all at once

Probably that's why I hit my weekly limits 3-4 days ago, and was scheduled to reset later today. I just checked, and they are already reset.

Not sure if it's already done, shouldn't there be a check somewhere to alert on if an outrageous number of tokens are getting written, then it's not right ?

nhinck3 50 days ago

So is it for latency or is it for cost?

Why did you lie 11 days ago, 3 days after the fix went in, about the cause of excess token usage?

bmitc 49 days ago

Appreciate the responses here. However, I feel like these responses are just to show us how much you know about the product and aren't actually helpful.

Instead, why don't you and Anthropic be more open about changes to these tools rather than waiting for users to complain, then investigating things after the fact that you should have investigated in the first place, and then posting on social media about all the cool tech details?

My company is tens of thousands strong. The amount of churn in Claude Code is a major issue and causing real awareness of the lack of stability and lack of customer support Anthropic provides.

And Claude Code is actually becoming a prototypical example of the dangers of vibe coded products and the burdens they place.

r00t- 50 days ago

We hit limits, and we come back when the limit is lifted. Isn't it obvious sessions are going to stay idle for more than 1 hour when Claude itself is hitting the limits?

I switched to Codex, Claude has gotten to a point where it's just unusable for the regular Joe.

arcza 51 days ago

You need to seriously look at your corporate communications and hire some adults to standarise your messaging, comms and signals. The volatility behind your doors is obvious to us and you'd impress us much more if you slowed down, took a moment to think about your customers and sent a consistent message.

You lost huge trust with the A/B sham test. You lost trust with enshittification of the tokenizer on 4.6 to 4.7. Why not just say "hey, due to huge input prices in energy, GPU demand and compute constraints we've had to increase Pro from $20 to $30." You might lose 5% of customers. But the shady A/B thing and dodgy tokenizer increasing burn rate tells everyone inc. enterprise that you don't care about honesty and integrity in your product.

I hope this feedback helps because you still stand to make an awesome product. Just show a little more professionalism.

0123456789ABCDE 50 days ago

2. could you bring back the _compact and accept plan_? even if it is not the default option.

fydorm 50 days ago

Add this to your `settings.json`:

"showClearContextOnPlanAccept": true,

infogulch 51 days ago

How big is the cache? Could you just evict the cache into cheap object storage and retrieve it when resuming? When the user starts the conversation back up show a "Resuming conversation... ⭕" spinner.

albert_e 50 days ago

> The challenge is: when you let a session idle for >1 hour, when you come back to it and send a prompt, it will be a full cache miss, all N messages. We noticed that this corner case led to outsized token costs for users.

I dont agree with this being characterized as a "corner case".

Isn't this how most long running work will happen across all serious users?

I am not at my desk babysitting a single CC chat session all day. I have other things to attend to -- and that was the whole point of agentic engineering.

Dont CC users take lunch breaks?

How are all these utterly common scenarios being named as corner cases -- as something that is wildly out of the norm, and UX can be sacrificed for those cases?

cowlby 50 days ago

Ahh that makes sense. Sometimes it's convenient to re-use an older conversation that has all the context I need. But maybe it's just the last 20% that's relevant.

It would be nice to be able to summarize/cut into a new leaner conversation vs having to coax all the context back into a fresh one. Something like keep the last 100,000 tokens.

I believe /compact achieves something like this? It just takes so long to summarize that it creates friction.

Folcon 50 days ago

Hi Boris

I'm curious why 1 hour was chosen?

Is increasing it a significant expense?

Ever since I heard about this behaviour I've been trying to figure out how to handle long running Claude sessions and so far every approach I've tried is suboptimal

It takes time to create a good context which can then trigger a decent amount of work in my experience, so I've been wondering how much this is a carefully tuned choice that's unlikely to change vs something adjustable

chris1993 51 days ago

So this explains why resuming a session after a 5-hour timeout basically eats most of the next session. How then to avoid this?

chid 50 days ago

Just curious, is there a consolidated list of all these "education" tips?

Intuitively I understand this due to how context windows work and you're looking to increase cache hits, has Anthropic tried compact/summarise on idle as a configurable option? Seems to have decent tradeoffs + education in a setting.

airstrike 51 days ago

Why is time the variable you're solving for? Why can't I keep that cache warm by keeping the session open?

nextaccountic 51 days ago

what about selling long term cache space to users?

or even, let the user control the cache expiry on a per request basis. with a /cache command

that way they decide if they want to drop the cache right away , or extend it for 20 hours etc

it would cost tokens even if the underlying resource is memory/SSD space, not compute

samusiam 50 days ago

For idle sessions I would MUCH rather pay the cost in tokens than reduced quality. Frankly, it's shocking to me that you would make that trade-off for users without their knowledge or consent.

taspeotis 50 days ago

Hi, thanks for Claude Code. I was wondering though if you'd considering adding a mode to make text green and characters come down from the top of the screen individually, like in The Matrix?

FuckButtons 51 days ago

From a utility perspective using a tiered cache with some much higher latency storage option for up to n hours would be very useful for me to prevent that l1 cache miss.

noname120 50 days ago

Why not automatically run a compaction close to the 1-hour mark? Then the cache miss won’t have such a bad impact.

jorjon 51 days ago

What about:

/loop 5m say "ok".

Will that keep the cache fresh?

tripzilch 49 days ago

I don't think it's fair or reasonable to charge your cache misses to the user.

PeterStuer 50 days ago

At least for me, option 2 seems far favorable to the others. Give me the info, then let me decide.

useyourforce 51 days ago

I actually have a suggestion here - do not hide token count in non-verbose mode in Claude Code.

growt 51 days ago

Wasn’t cache time reduced to 5 minutes? Or is that just some users interpretation of the bug?

sockaddr 51 days ago

Sorry but I think this should be left up to the user to decide how it works and how they want to burn their tokens. Also a countdown timer is better than all of these other options you mention.

foobarbecue 50 days ago

Hi Boris! Wanted to let you know that I find those ads with you saying "now when you code, you use an agent" obnoxious because of that incorrect statement. I have no interest in slop coding. I find it way more ergonomic and effective to use code to tell a machine precisely what to do than to use English to tell it vaguely. I hate that your ad is misleading so many non-coders, who will actually believe your lie that nobody codes anymore. Probably doesn't help that YouTube was playing it as an interruption in every video I watched. I probably saw it 100 times and was getting to the "throw the remote at the tv" stage XD.

baq 50 days ago

maybe you could surface an expected cache miss to the user

kang 51 days ago

> tokens written to cache all at once, which would eat up a significant % of your rate limits

Construction of context is not an llm pass - it shouldn't even count towards token usage. The word 'caching' itself says don't recompute me.

Since the devs on HN (& the whole world) is buying what looks like nonsense to me - what am I missing?

Majromax 50 days ago

> Since the devs on HN (& the whole world) is buying what looks like nonsense to me - what am I missing?

Input tokens are expensive, since the whole model has to be run for each token. They're cheaper than output tokens because the model doesn't need to run the sampler, so some pipeline parallelism is possible, but on the other hand without caching the input token cost would have to be paid anew for each output token.

Prompt caching fixes that O(N^2) cost, but the cache itself is very heavyweight. It needs one entry per input token per model layer, and each entry is an O(1000)-dimensional vector. That carries a huge memory cost (linear in context length), and when cached that means the context's memory space is no longer ephemeral.

That's why a 'cache write' can carry a cost; it is the cost of both processing the input and committing the backing store for the cache duration.

tadfisher 51 days ago

It astounds me that a company valued in the hundreds-of-billions-of-dollars has written this. One of the following must be true:

1. They actually believed latency reduction was worth compromising output quality for sessions that have already been long idle. Moreover, they thought doing so was better than showing a loading indicator or some other means of communicating to the user that context is being loaded.

2. What I suspect actually happened: they wanted to cost-reduce idle sessions to the bare minimum, and "latency" is a convenient-enough excuse to pass muster in a blog post explaining a resulting bug.

someguyiguess 51 days ago

It’s definitely a cost / resource saving strategy on their end.

raincole 51 days ago

It's very weird that they frame caching as "latency reduction" when it comes to a cloud service. I mean, yes, technically it reduces latency, but more importantly it reduces cost. Sometimes it's more than 80% of the total cost.

I'm sure most companies and customers will consider compromising quality for 80% cost reduction. If they just be honest they'll be fine.

retinaros 51 days ago

they just vibecoded a fix and didnt think about the tradeoff they were making and their always yes-man of a model just went with it

adam_patarino 50 days ago

It’s certainly #2. They have shown over dozens of decisions they move very quickly, break stuff, then have to both figure out what broke and how to explain it.

sekai 50 days ago

The same company that claims they have models that are too "dangerous" to release btw.

billywhizz 51 days ago

what's even more amazing is it took them two weeks to fix what must have been a pretty obvious bug, especially given who they are and what they are selling.

sockaddr 51 days ago

Yeah this is actually quite shocking. In my earlier uses of CC I might noodle on a problem for a while, come back and update the plan, go shower, think, give CC a new piece of advice, etc. Basically treating it like a coworker. And I thought that it was a static conversation (at least on the order of a day or so). An hour is absurd IMO and makes me want to rethink whether I want to keep my anthropic plan.

seizethecheese 51 days ago

It's also a bit of a fishy explanation for purging tokens older than an hour. This happens to also be their cache limit. I doubt it is incidental that this change would also dramatically drop their cost.

cma 51 days ago

They moved it to 5m around the same timeframe though: https://www.reddit.com/r/ClaudeAI/comments/1sk3m12/followup_...

zmmmmm 51 days ago

Seems like it would interact very badly with the time based usage reset. If lots of people are hitting their limit and then letting the session idle until they can come back, this wouldn't be an exception. It would almost be the default behaviour.

Aperocky 51 days ago

Wow, I always thought the context is always stored locally and this is something I have control over.

Glad I use kiro-cli which doesn't do this.

Bishonen88 50 days ago

you might be biased due to your employment :)

Aperocky 50 days ago

Objectively speaking, I want control of context and when I compact it.

That wouldn't change with employment.

greatgib 49 days ago

In addition with the bug, a big part of the issue is that this change was done secretly by Anthropic and not communicated to the users.

If that was done, users could have been mindful of the change and figure out more easily that their problems could have come from that.