|
|
|
|
|
by kingstnap
307 days ago
|
|
OpenAI has docs about how it works. https://platform.openai.com/docs/guides/prompt-caching It's fairly simple actually. Each machine stores the KV cache in blocks of 128 tokens. That's stored in a prefix tree like structure. Probably with some sort of LRU eviction policy. If you ask a machine to generate it does so starting from the longest matching sequence in the cache. They route between racks using a hash of the prefix. Therefore the system prompt, being frequently used and at the beginning of the context, will always be in the prefix cache. |
|