How much did this pretraining run cost? I am impressed that it is now practical to do such efforts.
Let me try a guess for the cost; please fact-check it if you can.
They indicate using 10^22 FLOPs.
A $5/h[0] EC2 H100 (1671 bfloat16 teraFLOPS[0]) instance will produce 830 TFLOPS at 50% MFU. The pretraining run thus costs (10^22/830e12)/3600*5 = $17K.
It would be twice that, since nVidia always lists "with sparsity" FLOPS as the headline number. But I bet they got a bunch of research credits to do this.