Y
Hacker News
new
|
ask
|
show
|
jobs
Llama 405B 506 tokens/second on an H200
(
developer.nvidia.com
)
21 points
by
moondistance
613 days ago
3 comments
EgoIncarnate
613 days ago
not "an H200", "In the table above, tensor parallelism is compared to pipeline parallelism with each across eight GPUs"
link
FanaHOVA
613 days ago
Title on HN is wrong. The article says GPUs and it's referring to one of their 8xH200 boxes.
link
7e
613 days ago
And this is why nobody submits MLPerf against NVIDIA.
link
greenknight
613 days ago
Its weird, i looked up whether AMD has any benchmarks on the 405B for the MI300x, and came across this one --
https://dstack.ai/blog/amd-mi300x-inference-benchmark/#token...
From my understanding, it can get up to around 2500 tokens/s? Both are 8x units (h200 and MI300x)
link
moondistance
613 days ago
Significant further optimizations. FP8!
link