Hacker News new | ask | show | jobs
by anthonix1 745 days ago
It converges similarly on smaller datasets.

About to kick off a training from scratch run on the same fineweb-10B, which at 324k toks/sec should take about 8.6 hours. And with my kWh cost, that is about $2.50 cost to train.

Will report back tomorrow when the training has finished..