Hacker News new | ask | show | jobs
by sytelus 744 days ago
So, NanoGPT took 1.8 days on 8xA100 for 124M model training on 30.7B tokens using flash attention. This would translate to 14.4hr for 10B tokens. With llm.c it is ~1.5 hr which is almost 10X speedup!

Does this look ballpark correct? Is there any summary of where majority of this improvement comes from?