| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by YetAnotherNick 581 days ago

8 H100 could achieve lot more than 1000 token/sec.

> Memory bandwidth for inferencing does not scale with the number of GPU

It does

1 comments

ryao 581 days ago

This is on llama 3.1 405B.

Inferencing is memory bandwidth bound. Add more GPUs on a batch size 1 inference problem and watch it run no faster than the memory bandwidth of a single GPU. It does not scale across the number of GPUs. If it could, you would see clusters of Nvidia hardware outperforming Cerebras’ hardware. That is currently a fantasy.

link

YetAnotherNick 580 days ago

This two sources[1][2] shows 1500-2500 token/per second on 8*H100.

[1]: https://lmsys.org/blog/2024-07-25-sglang-llama3/?ref=blog.ru...

[2]: https://www.snowflake.com/engineering-blog/optimize-llms-wit...

link