FWIW, I'm seeing ~318,000 toks/sec throughput on a 4x AMD 7900 XTX machine (less than $4k worth of GPU), using the same settings as in the post (0.5M batch size etc).
About to kick off a training from scratch run on the same fineweb-10B, which at 324k toks/sec should take about 8.6 hours. And with my kWh cost, that is about $2.50 cost to train.
Will report back tomorrow when the training has finished..
So... successfully reproduced in ~8.75 hours, taking about 18 kWh / $2.70
The first run actually failed at step 3000 or so, and I realized I had a bug in my attention / matmul kernels, but after fixing that and restarting it worked great
Final loss from that fineweb-10B run (since then I'm up to ~100k toks/sec/GPU):
step 18865/18865 | train loss 3.280550 | norm 0.4362 | lr 0.00e+00 | 1669.06 ms | 55.4% A100 fp16 MFU | 314058 tok/s
Writing state to log124M/state_00018865_00003.bin
val loss 3.296179
You can buy these GPUs on Amazon for under $1k. I heard the MI300X may be available in Azure now or at least very soon.