| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by YetAnotherNick 582 days ago
	You get ~400TFLOP/s in H100 for 350W. You need (2 * token/s * param count) FLOP/s. For 405b, 969tok/s you just need 784 TFLOP/s which is just 2 H100s. The limiting factor with GPU for inference is memory bandwidth. For 969 tok/s in int8, you need 392 TB/s memory bandwidth or 200 H100s.

3 comments

latchkey 582 days ago

Memory bandwidth and memory size. Along with power/cooling density.

Hence why you see AMD's MI325x coming out with 256GB HBM3e, but it is the same FLOPs as a 300x. 6TB/s too, which outperforms H200's, by a lot.

You can see the direction AMD is going with this...

https://www.amd.com/en/products/accelerators/instinct/mi300/...

link

Const-me 582 days ago

> For 969 tok/s in int8, you need 392 TB/s memory bandwidth

I think that math is only valid for batch size = 1. When these 969 tokens/second come from multiple sessions of the same batch, loaded model tensor elements are reused to compute many tokens for the entire batch. With large enough batches, you can even saturate compute throughput of the GPU instead of bottlenecking on memory bandwidth.

link

ryao 581 days ago

They claim to obtain that number with 8 to 20 concurrent users:

https://x.com/draecomino/status/1858998347090325846

link

ryao 581 days ago

Memory bandwidth for inferencing does not scale with the number of GPUs. Scaling instead requires more concurrent users. Also, I am told that 8 H100 cards can achieve 600 to 1000 tokens per second with concurrent users.

link

YetAnotherNick 581 days ago

8 H100 could achieve lot more than 1000 token/sec.

> Memory bandwidth for inferencing does not scale with the number of GPU

It does

link

ryao 581 days ago

This is on llama 3.1 405B.

Inferencing is memory bandwidth bound. Add more GPUs on a batch size 1 inference problem and watch it run no faster than the memory bandwidth of a single GPU. It does not scale across the number of GPUs. If it could, you would see clusters of Nvidia hardware outperforming Cerebras’ hardware. That is currently a fantasy.

link

YetAnotherNick 580 days ago

This two sources[1][2] shows 1500-2500 token/per second on 8*H100.

[1]: https://lmsys.org/blog/2024-07-25-sglang-llama3/?ref=blog.ru...

[2]: https://www.snowflake.com/engineering-blog/optimize-llms-wit...

link