| HN Mirror

Saying that it is just in index from string prefixes into KV Cache misses all the fun, interesting, and complicated parts of it. While technically the size of the prompt-pointers is tiny compared with the data it points into, the massive scale of managing this over all users and requests and routing inside the compute cluster makes it an expensive thing to implement and tune. Also, keeping the prompt cache sufficiently responsive and storing the large KV Caches somewhere costs a lot as well in resources.

I think that the OpenAI docs are pretty useful for the API level understanding of how it can work (https://developers.openai.com/api/docs/guides/prompt-caching...). The vLLM docs (https://docs.vllm.ai/en/stable/design/prefix_caching/) and SGLang radix hashing (https://lmsys.org/blog/2024-01-17-sglang/) are useful for insights into how to implement it locally for one computer ode.