| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by GodelNumbering 12 days ago
	Another neat thing is, they publish hourly caching states for ALL model/provider combinations. I did some research on it to come up with a provider tiers list and found a bunch of open-source 3rd party hosts are simply trash tier https://dirac.run/posts/cache-hit-rates-agents

3 comments

kflansburg 12 days ago

I would recommend tracking this data over time. I work on Cloudflare's KV cache for Kimi K2.6, and while there are periods where our cache rate is low, we are frequently in the 80-90% range. OpenRouter shows us at 87.3% at the time of this post. We observe cache rates change quite a bit from hour to hour.

link

GodelNumbering 12 days ago

True for Kimi, but the results I published are average across the models (CF has over 10 models on openrouter). Your current Kimi K2.6 is over 80% but Gemma 4 26B A4B is 0%. https://openrouter.ai/google/gemma-4-26b-a4b-it

This is also the reason providers like Anthropic scored lower because while Opus 4.7 is close to 90%, Opus 4.5 is 45%

link

kflansburg 12 days ago

My point was not about our ranking specifically, but the methodology of taking a point-in-time sample.

link

gnulinux 12 days ago

Thank you so much for this! I've been working on exactly this problem this week (which OpenRouter providers have the highest cache rate on average) because cache cost is sometimes half your cost: I'd much rather use a provider with more input caching with a more expensive/better LLM. Your results and lists seem more comprehensive than what I've done so far. Very helpful!

link

rkagerer 12 days ago

Agents push the full conversation history into context every turn

Why?

Maybe this is a dumb question, but why wouldn't an agent "keep the conversation going", like I do when interacting with an LLM through a web page? (I understand how it's impractical for long-running tasks where the agent has to wait days for the next input, but assume that's not the majority of use cases)

link

sosodev 12 days ago

I’m not sure I understand your question. Every interaction you have with a model in a web page does the same thing in the backend. It feeds the whole conversation history, perhaps with a bit of processing, into the model so it can process the next generation. Filling the context window is how these models retain coherence.

link

isbvhodnvemrwvn 12 days ago

LLMs are stateless, to predict next tokens they need the history. When you write your own agents you will be very selective and might trim context and heavily segment different tasks, but generic ones don't do that (at best they spawn subjects to handle smaller tasks)

link

lxgr 12 days ago

That said, the KV cache is very much not stateless, so internally inference APIs will be highly incentivized to route requests to instances with as much a shared prefix cached as possible.

link

rkagerer 11 days ago

Thanks. If I ran it local, presumably I could keep the state cached forever. Can you "reserve" resources from a frontier provider to guarantee your state stays "hot"? (Analogous to reserving a whole VM instead of a slice)

link

gpugreg 11 days ago

For Anthropic, 5 minute caching costs 1.25x base input price and 1 hour costs 2x base input price. https://platform.claude.com/docs/en/about-claude/pricing#pro...

For OpenAI, it seems like you can't prolong the caching duration for money. Duration is longer during off-peak hours for in-memory caching and up to 24 hours for extended prompt caching. https://developers.openai.com/api/docs/guides/prompt-caching

For DeepSeek, caching duration of at least 12 hours (and likely longer) have been observed. Cache writes are free. https://zhuanlan.zhihu.com/p/2035737726952194774

link

eknkc 12 days ago

BTW, the openai responses api has a store parameter and a thread id input. Makes it possible to send a thread id and append a new message, ask for completion. So it feels like keeping the conversation going.

Technically it does retrieve the entire history and reevaulate it since the LLM is stateless. Just more ergonomic for the developer.

And prompt caching helps cut the costs down when a conversation drags on.

link

drewnick 12 days ago

Wow, this is refreshing DX compared to iterating all messages like we did back in '24.

link

ghrl 11 days ago

I would disagree. Having all the messages locally and sending them with the request means you can switch inference providers or even models mid-conversation. It also means that the provider doesn't store the entire context, which often contains massive parts of proprietary codebases, secrets and PII and instead the agent harness manages all that. While a simple `continue thread` API field might seem more convenient, the cost is still determined by the input token count and cache rate, so it just abstracts this crucial implementation detail away.

link

BoredPositron 12 days ago

The "web page" does the same you just don't see it.

link