Hacker News new | ask | show | jobs
by giwook 61 days ago
Another way to think about it might be that caching is part of Anthropic's strategy to reduce costs for its users, but they are now trying to be more mindful of their costs (probably partly due to significant recent user growth as well as plans to IPO which demand fiscal prudence).

Perhaps if we were willing to pay more for our subscriptions Anthropic would be able to have longer cache windows but IDK one hour seems like a reasonable amount of time given the context and is a limitation I'm happy to work around (it's not that hard to work around) to pay just $100 or $200 a month for the industry-leading LLM.

Full disclosure: I've recently signed up for ChatGPT Pro as well in addition to my Claude Max sub so not really biased one way or the other. I just want a quality LLM that's affordable.

3 comments

I might be willing to pay more, maybe a lot more, for a higher subscription than claude max 20x, but the only thing higher is pay per token and i really dont like products that make me have to be that minutely aware of my usage, especially when it has unpredictability to it. I think there's a reason most telecoms went away from per minute or especially per MB charging. Even per GB, as they often now offer X GB, and im ok with that on phone but much less so on computer because of the unpredictability of a software update size.

Kinda like when restaurants make me pay for ketchup or a takeaway box, i get annoyed, just increase the compiled price.

For sure, I agree with that sentiment. It's interesting to consider the psychological component of that, like how "free shipping" is not really free, it's oftentimes just packaged into the price of the product but somehow it feels like we're getting a better deal.

I would not be surprised to see Anthropic, OpenAI etc head in the direction you mention as they mature and all of these datacenters currently undergoing construction come online in the next few years and drive down costs.

Token anxiety is real mental overhead.
That's the phrase i was looking for, thank you.
That doesn’t make sense to pay more for cache warming. Your session for the most part is already persisted. Why would it be reasonable to pay again to continue where you left off at any time in the future?
Because it significantly increases actual costs for Anthropic.

If they ignored this then all users who don’t do this much would have to subsidize the people who do.

I’m coming at this as a complete Claude amateur, but caching for any other service is an optimisation for the company and transparent for the user. I don’t think I’ve ever used a service and thought “oh there’s a cache miss. Gotta be careful”.

I completely agree that it’s infeasible for them to cache for long periods of time, but they need to surface that information in the tools so that we can make informed decisions.

That is because LLM KV caching is not like caches you are used to (see my other comments, but it's 10s of GB per request and involves internal LLM state that must live on or be moved onto a GPU and much of the cost is in moving all that data around). It cannot be made transparent for the user because the bandwidth costs are too large a fraction of unit economics for Anthropic to absorb, so they have to be surfaced to the user in pricing and usage limits. The alternative is a situation where users whose clients use the cache efficiently end up dramatically subsidizing users who use it inefficiently, and I don't think that's a good solution at all. I'd much rather this be surfaced to users as it is with all commercial LLM apis.
Think of it like this: Anthropic has to keep a full virtual machine running just for you. How long should it idle there taking resources when you only pay a static monthly fee and not hourly?

They have a limited number of resources and can’t keep everyone’s VM running forever.

I pay $5/mo to Vultr for a VM that runs continuously and maintains 25GB of state.
That price at Vultr gets you 1GB of RAM, and 25GB of relatively slow SSD.

The KV cache of your Claude context is:

- Potentially much larger than 25GB. (The KV cache sizes you see people quoting for local models are for smaller models.)

- While it's being used, it's all in RAM.

- Actually it's held in special high-performance GPU RAM, precision-bonded directly to the silicon of ludicrously expensive, state of the art GPUs.

- The KV state memory has to be many thousands of times faster than your 25GB state.

- It's much more expensive per GB than the CPU memory used by a VM. And that in turn is much more expensive than the SSD storage of your 25GB.

- Because Claude is used by far more people (and their agents) than rent VMs, far more people are competing to use that expensive memory at the same time

There is a lot going on to move KV cache state between GPU memory and dedicated, cheaper storage, on demand as different users need different state. But the KV cache data is so large, and used in its entirety when the context is active, that moving it around is expensive too.

Now check out the cost difference in 25GB of computer RAM vs GPU RAM.

And yes, this is also why computer RAM has jumped the shark in costs.

The bandwidth differences in total data transferred per hour aren't even in the same 5 orders of magnitude between your server and the workloads LLMs are doing. And this is why the compute and power markets are totally screwed.

It does not. It just has a fast way to give you the illusion it "runs continuously" with 25GB of warm memory.

Tbh, I'm not sure paged vram could solve this problem for an (assumed) huge cache miss system such as a major LLM server

Genuine question: is the cost to keep a persistent warmed cache for sessions idling for hours/days not significant when done for hundreds of thousands of users? Wouldn’t it pose a resource constraint on Anthropic at some point?
Related question, is it at all feasible to store cache locally to offload memory costs and then send it over the wire when needed?
No, the cache is a few GB large for most usual context sizes. It depends on model architecture, but if you take Gemma 4 31B at 256K context length, it takes 11.6GB of cache

note: I picked the values from a blog and they may be innacurate, but in pretty much all model the KV cache is very large, it's probably even larger in Claude.

To extend your point: it's not really the storage costs of the size of the cache that's the issue (server-side SSD storage of a few GB isn't expensive), it's the fact that all that data must be moved quickly onto a GPU in a system in which the main constraint is precisely GPU memory bandwidth. That is ultimately the main cost of the cache. If the only cost was keeping a few 10s of GB sitting around on their servers, Anthropic wouldn't need to charge nearly as much as they do for it.
That cost that you're talking about doesn't change based on how long the session is idle. No matter what happens they're storing that state and bring it back at some point, the only difference is how long it's stored out of GPU between requests.
Yesterday I was playing around with Gemma4 26B A4B with a 3 bit quant and sizing it for my 16GB 9070XT:

  Total VRAM: 16GB
  Model: ~12GB
  128k context size: ~3.9GB
At least I'm pretty sure I landed on 128k... might have been 64k. Regardless, you can see the massive weight (ha) of the meager context size (at least compared to frontier models).
Sure, it wouldn’t make sense if they only had one customer to serve :)
Exactly, even in the throes of today's wacky economic tides, storage is still cheap. Write the model state immediately after the N context messages in cache to disk and reload without extra inference on the context tokens themselves. If every customer did this for ~3 conversations per user you still would only need a small fraction of a typical datacenter to house the drives necessary. The bottleneck becomes architecture/topology and the speed of your buses, which are problems that have been contended with for decades now, not inference time on GPUs.
This has nothing to do with the cost of storage. Surprisingly, you are not better informed than Anthropic on the subject of serving AI inference models.

A sibling comment explains:

https://news.ycombinator.com/item?id=47886200

They don't cache model state to disk. I am proposing they do.
I’m proposing that you should educate yourself on the subject of LLM KV context caching.
It may be persisted but it is not live in the inference engine.
The reason I've been querying the 1 hour is a user's quota resets are often longer than that, as a result I've seen situations where someone builds a large context, then hits their quota limit, waits 2+ hours, their cache is gone, their first message then eats 20%+ of their current session quota and the user doesn't want to compact as they're still trying to get the model into a good understanding of the problem, this seems to be a really painful consequence for users on anything less than a max plan which seems like an unintended consequence of Anthropic's own system design choices?

IE How their quota and caching interact with each other, it doesn't make pro and max a little different, it makes it significantly different by unintentionally penalising pro users