Hacker News new | ask | show | jobs
by dekhn 1106 days ago
Can the creators explain in more detail: how is this different from (for example) the OpenAI cluster that MSFT built in Azure? Is it hosted in an existing cloud provider, or in a data center? Which data center? Who admins the system, is there an SRE team in case it goes down during training? And can you attempt ot run the same benchmarks that Top500 uses to determine what your double precision flops are and give that number in addition to your "10 exaflops" (which I believe is single precision).
1 comments

Pretty sure it's FP8, not singles. (Which for the H100 makes a 60x difference.)
as an ex-supercomputer nerd (where the fastest system in teh world finally reached over 1 exaflops of double precision), it seems awfully weird to call FP8 "flops". There's nothing truly wrong with it (since "flops" is a fairly poorly defined term), but it makes it clear that ML supercomputers are very different beasts from classic supercomputers. And also makes me wonder if/when the classic folks will try to make more codes work correctly with smaller precision (for example, in molecular dynamics).