| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by isbvhodnvemrwvn 18 days ago
	LLMs are stateless, to predict next tokens they need the history. When you write your own agents you will be very selective and might trim context and heavily segment different tasks, but generic ones don't do that (at best they spawn subjects to handle smaller tasks)

2 comments

lxgr 18 days ago

That said, the KV cache is very much not stateless, so internally inference APIs will be highly incentivized to route requests to instances with as much a shared prefix cached as possible.

link

rkagerer 17 days ago

Thanks. If I ran it local, presumably I could keep the state cached forever. Can you "reserve" resources from a frontier provider to guarantee your state stays "hot"? (Analogous to reserving a whole VM instead of a slice)

link

gpugreg 17 days ago

For Anthropic, 5 minute caching costs 1.25x base input price and 1 hour costs 2x base input price. https://platform.claude.com/docs/en/about-claude/pricing#pro...

For OpenAI, it seems like you can't prolong the caching duration for money. Duration is longer during off-peak hours for in-memory caching and up to 24 hours for extended prompt caching. https://developers.openai.com/api/docs/guides/prompt-caching

For DeepSeek, caching duration of at least 12 hours (and likely longer) have been observed. Cache writes are free. https://zhuanlan.zhihu.com/p/2035737726952194774

link