Hacker News new | ask | show | jobs
by anthonix1 746 days ago
FWIW, I'm seeing ~318,000 toks/sec throughput on a 4x AMD 7900 XTX machine (less than $4k worth of GPU), using the same settings as in the post (0.5M batch size etc).
2 comments

How much % of the theoretical FLOPs are you getting with those 7900 XTX on training?
55.4% in the last run, at running temperature
Did you reproduce the evaluation as well?
It converges similarly on smaller datasets.

About to kick off a training from scratch run on the same fineweb-10B, which at 324k toks/sec should take about 8.6 hours. And with my kWh cost, that is about $2.50 cost to train.

Will report back tomorrow when the training has finished..

So... successfully reproduced in ~8.75 hours, taking about 18 kWh / $2.70

The first run actually failed at step 3000 or so, and I realized I had a bug in my attention / matmul kernels, but after fixing that and restarting it worked great

[1] https://github.com/anthonix/llm.c

What was the final loss? Is this hardware available for rent somewhere?
Final loss from that fineweb-10B run (since then I'm up to ~100k toks/sec/GPU):

step 18865/18865 | train loss 3.280550 | norm 0.4362 | lr 0.00e+00 | 1669.06 ms | 55.4% A100 fp16 MFU | 314058 tok/s Writing state to log124M/state_00018865_00003.bin val loss 3.296179

You can buy these GPUs on Amazon for under $1k. I heard the MI300X may be available in Azure now or at least very soon.