Hacker News new | ask | show | jobs
by GodelNumbering 12 days ago
Another neat thing is, they publish hourly caching states for ALL model/provider combinations. I did some research on it to come up with a provider tiers list and found a bunch of open-source 3rd party hosts are simply trash tier https://dirac.run/posts/cache-hit-rates-agents
3 comments

I would recommend tracking this data over time. I work on Cloudflare's KV cache for Kimi K2.6, and while there are periods where our cache rate is low, we are frequently in the 80-90% range. OpenRouter shows us at 87.3% at the time of this post. We observe cache rates change quite a bit from hour to hour.
True for Kimi, but the results I published are average across the models (CF has over 10 models on openrouter). Your current Kimi K2.6 is over 80% but Gemma 4 26B A4B is 0%. https://openrouter.ai/google/gemma-4-26b-a4b-it

This is also the reason providers like Anthropic scored lower because while Opus 4.7 is close to 90%, Opus 4.5 is 45%

My point was not about our ranking specifically, but the methodology of taking a point-in-time sample.
Thank you so much for this! I've been working on exactly this problem this week (which OpenRouter providers have the highest cache rate on average) because cache cost is sometimes half your cost: I'd much rather use a provider with more input caching with a more expensive/better LLM. Your results and lists seem more comprehensive than what I've done so far. Very helpful!
Agents push the full conversation history into context every turn

Why?

Maybe this is a dumb question, but why wouldn't an agent "keep the conversation going", like I do when interacting with an LLM through a web page? (I understand how it's impractical for long-running tasks where the agent has to wait days for the next input, but assume that's not the majority of use cases)

I’m not sure I understand your question. Every interaction you have with a model in a web page does the same thing in the backend. It feeds the whole conversation history, perhaps with a bit of processing, into the model so it can process the next generation. Filling the context window is how these models retain coherence.
LLMs are stateless, to predict next tokens they need the history. When you write your own agents you will be very selective and might trim context and heavily segment different tasks, but generic ones don't do that (at best they spawn subjects to handle smaller tasks)
That said, the KV cache is very much not stateless, so internally inference APIs will be highly incentivized to route requests to instances with as much a shared prefix cached as possible.
Thanks. If I ran it local, presumably I could keep the state cached forever. Can you "reserve" resources from a frontier provider to guarantee your state stays "hot"? (Analogous to reserving a whole VM instead of a slice)
For Anthropic, 5 minute caching costs 1.25x base input price and 1 hour costs 2x base input price. https://platform.claude.com/docs/en/about-claude/pricing#pro...

For OpenAI, it seems like you can't prolong the caching duration for money. Duration is longer during off-peak hours for in-memory caching and up to 24 hours for extended prompt caching. https://developers.openai.com/api/docs/guides/prompt-caching

For DeepSeek, caching duration of at least 12 hours (and likely longer) have been observed. Cache writes are free. https://zhuanlan.zhihu.com/p/2035737726952194774

BTW, the openai responses api has a store parameter and a thread id input. Makes it possible to send a thread id and append a new message, ask for completion. So it feels like keeping the conversation going.

Technically it does retrieve the entire history and reevaulate it since the LLM is stateless. Just more ergonomic for the developer.

And prompt caching helps cut the costs down when a conversation drags on.

Wow, this is refreshing DX compared to iterating all messages like we did back in '24.
I would disagree. Having all the messages locally and sending them with the request means you can switch inference providers or even models mid-conversation. It also means that the provider doesn't store the entire context, which often contains massive parts of proprietary codebases, secrets and PII and instead the agent harness manages all that. While a simple `continue thread` API field might seem more convenient, the cost is still determined by the input token count and cache rate, so it just abstracts this crucial implementation detail away.
The "web page" does the same you just don't see it.