|
|
|
|
|
by rfoo
850 days ago
|
|
I'm not talking about the so-called quadratic memory requirement of the attention step, there NEVER WAS ONE. I'm talking about a simple fact - to efficiently (cost-wise) run LLM inference you have to have a KV "cache" and its size grows (linearly) by your expected batch size and your context window length. With a large context window length it become even bigger than model weight. I don't want to be mean, but sorry: Sorry, read up on PagedAttention. You clearly don't know what you are talking about, please be better. |
|