|
|
|
|
|
by mmoskal
636 days ago
|
|
Decode speed is generally memory bandwidth bound. Prefill is typically arithmetic bound. This is the reason for mixed batches (both decode and prefill) - it let's you saturate both memory and arithmetic. Chunked prefill is for minimizing latency for decode entries in the same batch. It's not needed if you have only one request - in that case it's the fastest to just prefill in one chunk. I'm pretty sure the sibling comment is right about different length limits - it's because of training and model talking nonsense if you let too long. |
|
For example, consider a prompt sent to Llama 3.1 405B that uses 128k input tokens.
The KV cache will be 123GB. No matter how many GPUs you shard the model across, you are not fitting that KV cache in GPU memory (a H100 has 80GB)