| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rwitten 994 days ago

Hey! I'm an contributor on this (Rafi Witten), all opinions my own.

You're asking the right question but I think the math is off by a bit. The equivalent number on the H100's is 989 TFLOP/s/chip so the equivalent job is ~10K H100's = (10 * 10^18) / (989 * 10^12). (Both chips also have 8-bit acceleration!)

I believe this is the largest ML job both by exaflops and number of chips every demonstrated. Other companies own more chips or exaflops than we show in this job but getting all the hardware working at once on a single job is a different matter! :-)

2 comments

sashank_1509 994 days ago

I think your math is also slightly off, in the Google article, it claims “that is capable of achieving 10 exa-FLOPs (16-bit).” , so you should be comparing with 16 bit operations from a H100.

989 is TF32 core, for 16 bit it is 1979, so I guess around 5000 H100’s in a single training job would be equivalent to the training job mentioned in this article.

Either way I actually would not be surprised if OpenAI has launched a single job on more than 10k GPU’s, but I also am not very knowledgeable on practical scaling. Congrats on the feat!

link

aschleck 994 days ago

1979 16 bit flops on an H100 is with sparsity. See footnote 2 on https://www.nvidia.com/en-us/data-center/h100/. You should be halving it for non-sparse flops.

link

YetAnotherNick 994 days ago

GP is correct. With sparsity it is 3958. 1979 Tflop/s is without sparsity.

link

emu 994 days ago

No, it is not. That's the sparse fp8 flop number, but you need to ignore sparsity and compare bf16 flops not fp8 flops for the comparison the ancestor post is making.

link

latchkey 994 days ago

I'd love to hear more about the challenges of getting the hardware working.

link