| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by crazygringo 354 days ago

How does a prompt this long affect resource usage?

Does inference need to process this whole thing from scratch at the start of every chat?

Or is there some way to cache the state of the LLM after processing this prompt, before the first user token is received, and every request starts from this cached state?

1 comments

mdaniel 354 days ago

My understanding is that is what the KV cache does in models serving. I would imagine they'd want to prime any such KV cache with common tokens but retain a per-session cache to avoid leaks. It seems HF agrees with the concept, at least https://huggingface.co/docs/transformers/kv_cache#prefill-a-...

link

kingstnap 353 days ago

OpenAI has docs about how it works.

https://platform.openai.com/docs/guides/prompt-caching

It's fairly simple actually. Each machine stores the KV cache in blocks of 128 tokens.

That's stored in a prefix tree like structure. Probably with some sort of LRU eviction policy.

If you ask a machine to generate it does so starting from the longest matching sequence in the cache.

They route between racks using a hash of the prefix.

Therefore the system prompt, being frequently used and at the beginning of the context, will always be in the prefix cache.

link

crazygringo 352 days ago

Fascinating, exactly what I was wondering about. Thank you! Turns out it's very sophisticated, and also explains why the current date is always at the very end of the system prompt.

link