Hacker News new | ask | show | jobs
by mmoskal 453 days ago
Just to clarify: simple prefix KV cache doesn't require any special model training. It does require the inference framework to support it, but most do by now.

You can see dramatic improvements in latency and throughput if there is a large shared prefix of the queries.