It's a latency vs. throughput tradeoff. I was surprised as well. But most GPUs can do 32 inferences in the same time as they can do 1 inference. They have all the parallel units required and there are significant setup costs that can be amortized since all the inferences share the same model, weights, etc.
Very interesting, thank you. I will point one of my colleagues that is busy with this stuff to these and I thank you on his behalf as well, it is exactly the kind of thing they are engaged in.