|
|
|
|
|
by menaerus
483 days ago
|
|
Re-computing everything every time is the worst-case scenario and which is why I included it in the example (1k tokens). In that case, KV-cache is obviously set to 0 but it is also obvious that it is a much worse alternative than using the KV-cache. Which is pretty much the reason why we have the KV-cache. Therefore the argument about loading the cached tensors doesn't make a difference at all. > It's going to take me some minutes to find out what's wrong in this napkin math. I am sure you will. Please don't be so entitled. |
|
Sorry, what? Who the fuck in this world runs decode without k/v cache??! If you run without k/v cache you are basically doing prefill for every token you generate and that's not what we called "decode". That's what we called "prefill".
k/v cache, while named "cache", is a lot more important than what people would perceive as a "cache". It's the essential part of the algorithm. If you lose your k/v cache you must run prefill again. If you run prefill for every token you generate it's not O(n^2), it's going to be O(n^3).
And yeah, you can run prefill 1000 times to generate a 1000 tokens output. Or you can run prefill once and with the persisted k/v cache run decode 1000 times. Tradeoff has to be made here but it simply makes no sense to drop a k/v cache in the middle of generating a response, as your number shows, recomputing is guaranteed to be slower than loading k/v cache.
> Please don't be so entitled.
When someone came up with a wrong number, I try to be nice and run the numbers myself and figure out why someone would end up with such a number and point out the specific mistake, instead of dumping a page of my own calculation. It's usually just a missing factor somewhere. Guess I shouldn't be so nice to retards who keep insisting that you can be fine without k/v cache during decoding. Also in this case I admit I failed to have a theory on why your number is so off because giving out prefill numbers and claiming it's decode isn't in my book.
Yeah, I know this sounds extremely mean, feel free to downvote, but I hope readers can feel my frustration now.