|
|
|
|
|
by onnodigcomplex
1253 days ago
|
|
I'm not on my desktop, but I used nanoGPT extensively on my RTX-4090. I trained a GPT2-small but with a small context window (125 > 90M params) at batch sizes of 1536 using gradient accumulation (3*512). This runs at just a smidge over 1 it/s. Some notes - Gradient checkpointing and 2048 batch size in a single go allows ~10% performance improvement on a per sample basis. - torch.compile doesn't work for me yet (lowest cuda version I got my 4090 to run on was 11.8 but highest cuda version on which I got the model to compile is 11.7). - I did the optimalisations in https://arxiv.org/abs/2212.14034 |
|
Do you train in fp16/bf16?
Have you tried fp8?