Llama 405B 506 tokens/second on an H200

Y	Hacker News new \| ask \| show \| jobs

	Llama 405B 506 tokens/second on an H200 (developer.nvidia.com)
	21 points by moondistance 613 days ago

3 comments

not "an H200", "In the table above, tensor parallelism is compared to pipeline parallelism with each across eight GPUs"

Title on HN is wrong. The article says GPUs and it's referring to one of their 8xH200 boxes.

And this is why nobody submits MLPerf against NVIDIA.

Its weird, i looked up whether AMD has any benchmarks on the 405B for the MI300x, and came across this one -- https://dstack.ai/blog/amd-mi300x-inference-benchmark/#token...

From my understanding, it can get up to around 2500 tokens/s? Both are 8x units (h200 and MI300x)

Significant further optimizations. FP8!