|
|
|
|
|
by lostmsu
511 days ago
|
|
This was a simplification. Of course you need some extra VRAM I/O based on your KV cache size. But assuming your KV cache size is << model size, that simplification is pretty accurate. See, e.g. https://www.databricks.com/blog/llm-inference-performance-en... You can just scroll to the first chart they have that explains the idea. |
|