|
|
|
|
|
by bluegatty
149 days ago
|
|
When history is cached conversations tend not to be slower, because the LLM can 'continue' from a previous state. So if there was already A + A1 + B + B1 + C + C1 and you asking 'D' ... well, [A->C1] is saved as state. It costs 10ms to prepare. Then, they add 'D' as your question and that will be done 'all tokens at once' in bulk - which is fast. Then - they they generate D1 (the response) they have to do it one token at a time, which is slow. Each token has to be processed separately. Also - even if they had to redo- all of [A->C1] 'from scratch' - its not that slow, because the entire block of tokens can be processed in one pass. 'prefill' (aka A->C1) is fast, which by the way is why it's 10x cheaper. So prefill is 10x faster than generation, and cache is 10x cheaper than prefill as a very general rule of thumb. |
|