| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ryao 581 days ago
	Memory bandwidth for inferencing does not scale with the number of GPUs. Scaling instead requires more concurrent users. Also, I am told that 8 H100 cards can achieve 600 to 1000 tokens per second with concurrent users.

1 comments

YetAnotherNick 581 days ago

8 H100 could achieve lot more than 1000 token/sec.

> Memory bandwidth for inferencing does not scale with the number of GPU

It does

link

ryao 581 days ago

This is on llama 3.1 405B.

Inferencing is memory bandwidth bound. Add more GPUs on a batch size 1 inference problem and watch it run no faster than the memory bandwidth of a single GPU. It does not scale across the number of GPUs. If it could, you would see clusters of Nvidia hardware outperforming Cerebras’ hardware. That is currently a fantasy.

link

YetAnotherNick 580 days ago

This two sources[1][2] shows 1500-2500 token/per second on 8*H100.

[1]: https://lmsys.org/blog/2024-07-25-sglang-llama3/?ref=blog.ru...

[2]: https://www.snowflake.com/engineering-blog/optimize-llms-wit...

link