|
|
|
|
|
by rfoo
763 days ago
|
|
> it's just setting some values But it may very well be slower than just recompute it. At least for ordinary MHA and even GQA. So, either a model arch woodoo significantly reducing kv cache size (while keeping roughly the same compute cost), or some really careful implementation moving kv cache of upcoming requests to devices in background [0]. [0] My back of envelop calc shows that even then it still does not make sense for, say, Llama 3 70B on H100s. Time to stare at TPU spec harder trying to make sense of it I guess. |
|
You can also design an API optimized for batch workloads (say the same core prompt with different data for instruct-style reasoning) - that can result in large savings in those scenarios.