Hacker News new | ask | show | jobs
by kingstnap 307 days ago
OpenAI has docs about how it works.

https://platform.openai.com/docs/guides/prompt-caching

It's fairly simple actually. Each machine stores the KV cache in blocks of 128 tokens.

That's stored in a prefix tree like structure. Probably with some sort of LRU eviction policy.

If you ask a machine to generate it does so starting from the longest matching sequence in the cache.

They route between racks using a hash of the prefix.

Therefore the system prompt, being frequently used and at the beginning of the context, will always be in the prefix cache.

1 comments

Fascinating, exactly what I was wondering about. Thank you! Turns out it's very sophisticated, and also explains why the current date is always at the very end of the system prompt.