| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sytelus 791 days ago
	So, NanoGPT took 1.8 days on 8xA100 for 124M model training on 30.7B tokens using flash attention. This would translate to 14.4hr for 10B tokens. With llm.c it is ~1.5 hr which is almost 10X speedup! Does this look ballpark correct? Is there any summary of where majority of this improvement comes from?