|
|
|
|
|
by kouteiheika
120 days ago
|
|
> The main explanation of bottleneck is wrong: it’s the model weights which dominate memory bandwidth (and hence why batching multiple requests in a single pass increases total throughput). If copy user tokens was the bottle neck, batching would not achieve any speed up. Inference is memory-bound only at low batch sizes. At high batch sizes it becomes compute-bound. There's a certain threshold where stuffing more requests in a batch will slow down every request in isolation even though it may still increase the number of tokens/second across the whole batch for all request in aggregate. |
|
Also there does not exist any batch size > 1 where per-request throughput is equal to bs=1. Doing any batching at all will slow all intra-batch requests down.