| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by libraryofbabel 54 days ago
	They are caching internal LLM state, which is in the 10s of GB for each session. It's called a KV cache (because the internal state that is cached are the K and V matrices) and it is fundamental to how LLM inference works; it's not some Anthropic-specific design decision. See my other comment for more detail and a reference.