|
|
|
|
|
by sytelus
744 days ago
|
|
So, NanoGPT took 1.8 days on 8xA100 for 124M model training on 30.7B tokens using flash attention. This would translate to 14.4hr for 10B tokens. With llm.c it is ~1.5 hr which is almost 10X speedup! Does this look ballpark correct? Is there any summary of where majority of this improvement comes from? |
|