| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by anthonix1 793 days ago
	FWIW, I'm seeing ~318,000 toks/sec throughput on a 4x AMD 7900 XTX machine (less than $4k worth of GPU), using the same settings as in the post (0.5M batch size etc).

2 comments

Manabu-eo 792 days ago

How much % of the theoretical FLOPs are you getting with those 7900 XTX on training?

link

anthonix1 792 days ago

55.4% in the last run, at running temperature

link

pama 793 days ago

Did you reproduce the evaluation as well?

link

anthonix1 793 days ago

It converges similarly on smaller datasets.

About to kick off a training from scratch run on the same fineweb-10B, which at 324k toks/sec should take about 8.6 hours. And with my kWh cost, that is about $2.50 cost to train.

Will report back tomorrow when the training has finished..

link

anthonix1 792 days ago

So... successfully reproduced in ~8.75 hours, taking about 18 kWh / $2.70

The first run actually failed at step 3000 or so, and I realized I had a bug in my attention / matmul kernels, but after fixing that and restarting it worked great

[1] https://github.com/anthonix/llm.c

link

pama 789 days ago

What was the final loss? Is this hardware available for rent somewhere?

link

anthonix1 783 days ago

Final loss from that fineweb-10B run (since then I'm up to ~100k toks/sec/GPU):

You can buy these GPUs on Amazon for under $1k. I heard the MI300X may be available in Azure now or at least very soon.

link