| > nanochat is also inspired by modded-nanoGPT Nice synergy here, the lineage is: Karpathy's nano-GPT -> Keller Jordan's modded-nanoGPT (a speedrun of training nanoGPT) -> NanoChat modded-nanoGPT [1] is a great project, well worth checking out, it's all about massively speeding up the training of a small GPT model. Notably it uses the author's Muon optimizer [2], rather than AdamW, (for the linear layers). [1] https://github.com/KellerJordan/modded-nanogpt [2] https://kellerjordan.github.io/posts/muon/ |