Y
Hacker News
new
|
ask
|
show
|
jobs
by
tootyskooty
306 days ago
I suspect one can go a lot further by adopting some tweaks from the GPT-2 speedrun effort [0], at minimum Muon, better init and carefully tuning learning rate.
[0]:
https://github.com/KellerJordan/modded-nanogpt