|
|
|
|
|
by dekhn
1043 days ago
|
|
Actually many inference systems instead batch all requests within a time period and submit them as a single shot. It increases the average latency but handles more requests per unit time. (at least, this is my understanding how production serving of expensive models that support batching work) |
|
- memory bandwidth
- interconnect bandwidth between the CPU and GPU
- interconnect bandwidth between GPUs
- thermals and power if you're doing a good job of optimizing the rest
I don't see how a batching mechanism would improve on any of those, superficially it looks as though that would make matters worse rather than better. Can you explain where the advantage comes from?