Hacker News new | ask | show | jobs
by isbvhodnvemrwvn 18 days ago
LLMs are stateless, to predict next tokens they need the history. When you write your own agents you will be very selective and might trim context and heavily segment different tasks, but generic ones don't do that (at best they spawn subjects to handle smaller tasks)
2 comments

That said, the KV cache is very much not stateless, so internally inference APIs will be highly incentivized to route requests to instances with as much a shared prefix cached as possible.
Thanks. If I ran it local, presumably I could keep the state cached forever. Can you "reserve" resources from a frontier provider to guarantee your state stays "hot"? (Analogous to reserving a whole VM instead of a slice)
For Anthropic, 5 minute caching costs 1.25x base input price and 1 hour costs 2x base input price. https://platform.claude.com/docs/en/about-claude/pricing#pro...

For OpenAI, it seems like you can't prolong the caching duration for money. Duration is longer during off-peak hours for in-memory caching and up to 24 hours for extended prompt caching. https://developers.openai.com/api/docs/guides/prompt-caching

For DeepSeek, caching duration of at least 12 hours (and likely longer) have been observed. Cache writes are free. https://zhuanlan.zhihu.com/p/2035737726952194774