| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tootyskooty 352 days ago
	I suspect one can go a lot further by adopting some tweaks from the GPT-2 speedrun effort [0], at minimum Muon, better init and carefully tuning learning rate. [0]: https://github.com/KellerJordan/modded-nanogpt