| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by txyx303 658 days ago
	Batched inference will increase your overall throughput, but each user will still be seeing the original throughput number. It's not necessarily a memory vs compute issue in the same way training is. It's more a function of the auto-regressive nature of transformer inference as far as I understand which presents unique challenges. If you have an H100 doing 100 tokens/sec and you batch 1000 requests, you might be able to get to 100K tok/sec but each user's request will still be outputting 100 tokens/sec which will make the speed of the response stream the same. So if your output stream speed is slow, batching might not improve user experience, even if you can get a higher chip utilization / "overall" throughput.