| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mdaniel 355 days ago
	My understanding is that is what the KV cache does in models serving. I would imagine they'd want to prime any such KV cache with common tokens but retain a per-session cache to avoid leaks. It seems HF agrees with the concept, at least https://huggingface.co/docs/transformers/kv_cache#prefill-a-...

1 comments

kingstnap 353 days ago

OpenAI has docs about how it works.

https://platform.openai.com/docs/guides/prompt-caching

It's fairly simple actually. Each machine stores the KV cache in blocks of 128 tokens.

That's stored in a prefix tree like structure. Probably with some sort of LRU eviction policy.

If you ask a machine to generate it does so starting from the longest matching sequence in the cache.

They route between racks using a hash of the prefix.

Therefore the system prompt, being frequently used and at the beginning of the context, will always be in the prefix cache.

crazygringo 353 days ago

Fascinating, exactly what I was wondering about. Thank you! Turns out it's very sophisticated, and also explains why the current date is always at the very end of the system prompt.