|
|
|
|
|
by menaerus
515 days ago
|
|
That's not how it works because if what you're saying had been true then the self-attention memory complexity would be O(1), e.g. regardless of the batch size. This obviously isn't the case since each batch computation necessitates it's own load/store memory bandwidth. I suggest reading one of the transformers papers to really understand how it works. |
|
But assuming your KV cache size is << model size, that simplification is pretty accurate.
See, e.g. https://www.databricks.com/blog/llm-inference-performance-en...
You can just scroll to the first chart they have that explains the idea.