Hacker News new | ask | show | jobs
by dekhn 1044 days ago
It's a latency vs. throughput tradeoff. I was surprised as well. But most GPUs can do 32 inferences in the same time as they can do 1 inference. They have all the parallel units required and there are significant setup costs that can be amortized since all the inferences share the same model, weights, etc.

https://groq.com/wp-content/uploads/2020/05/GROQP002_V2.2.pd... the "batching" section of https://docs.nvidia.com/deeplearning/tensorrt/archives/tenso... https://le.qun.ch/en/blog/2023/05/13/transformer-batching/

1 comments

Very interesting, thank you. I will point one of my colleagues that is busy with this stuff to these and I thank you on his behalf as well, it is exactly the kind of thing they are engaged in.