My understanding is that is what the KV cache does in models serving. I would imagine they'd want to prime any such KV cache with common tokens but retain a per-session cache to avoid leaks. It seems HF agrees with the concept, at least https://huggingface.co/docs/transformers/kv_cache#prefill-a-...
Fascinating, exactly what I was wondering about. Thank you! Turns out it's very sophisticated, and also explains why the current date is always at the very end of the system prompt.
https://platform.openai.com/docs/guides/prompt-caching
It's fairly simple actually. Each machine stores the KV cache in blocks of 128 tokens.
That's stored in a prefix tree like structure. Probably with some sort of LRU eviction policy.
If you ask a machine to generate it does so starting from the longest matching sequence in the cache.
They route between racks using a hash of the prefix.
Therefore the system prompt, being frequently used and at the beginning of the context, will always be in the prefix cache.