| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dekhn 1043 days ago
	Actually many inference systems instead batch all requests within a time period and submit them as a single shot. It increases the average latency but handles more requests per unit time. (at least, this is my understanding how production serving of expensive models that support batching work)

1 comments

jacquesm 1043 days ago

I've done a bunch of optimization for GPU code (in CUDA) and there are typically a few bottle necks that really matter:

- memory bandwidth

- interconnect bandwidth between the CPU and GPU

- interconnect bandwidth between GPUs

- thermals and power if you're doing a good job of optimizing the rest

I don't see how a batching mechanism would improve on any of those, superficially it looks as though that would make matters worse rather than better. Can you explain where the advantage comes from?

link

dekhn 1043 days ago

It's a latency vs. throughput tradeoff. I was surprised as well. But most GPUs can do 32 inferences in the same time as they can do 1 inference. They have all the parallel units required and there are significant setup costs that can be amortized since all the inferences share the same model, weights, etc.

https://groq.com/wp-content/uploads/2020/05/GROQP002_V2.2.pd... the "batching" section of https://docs.nvidia.com/deeplearning/tensorrt/archives/tenso... https://le.qun.ch/en/blog/2023/05/13/transformer-batching/

link

jacquesm 1043 days ago

Very interesting, thank you. I will point one of my colleagues that is busy with this stuff to these and I thank you on his behalf as well, it is exactly the kind of thing they are engaged in.

link

ColonelPhantom 1043 days ago

I think in the case of LLM inference the main bottleneck is streaming the weights from VRAM to CU/SM/EU (whatever naming your GPU vendor of choice uses).

If you're doing inference on multiple prompts at the same time by doing batching, you don't take more time in streaming. But each streamed weights gets used for, say, 32 calculations instead of 1, making better use of the GPU's compute resources.

link