| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bluegatty 196 days ago

When history is cached conversations tend not to be slower, because the LLM can 'continue' from a previous state.

So if there was already A + A1 + B + B1 + C + C1 and you asking 'D' ... well, [A->C1] is saved as state. It costs 10ms to prepare. Then, they add 'D' as your question and that will be done 'all tokens at once' in bulk - which is fast.

Then - they they generate D1 (the response) they have to do it one token at a time, which is slow. Each token has to be processed separately.

Also - even if they had to redo- all of [A->C1] 'from scratch' - its not that slow, because the entire block of tokens can be processed in one pass.

'prefill' (aka A->C1) is fast, which by the way is why it's 10x cheaper.

So prefill is 10x faster than generation, and cache is 10x cheaper than prefill as a very general rule of thumb.

1 comments

Fazebooking 196 days ago

Thats only the case with KV Cache and we do not know how and how long providers keep it.

link

bluegatty 196 days ago

Prefill is 10x faster than generation without caching, and 100x faster with caching - as a very crude measure. So it's not a matter of 'only the case'. Those are different scenarios. Some hosts are better than others with respect to managing caching, but the better one's provide decent SLA on that.

link