| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by onnodigcomplex 1253 days ago

I'm not on my desktop, but I used nanoGPT extensively on my RTX-4090. I trained a GPT2-small but with a small context window (125 > 90M params) at batch sizes of 1536 using gradient accumulation (3*512). This runs at just a smidge over 1 it/s. Some notes

- Gradient checkpointing and 2048 batch size in a single go allows ~10% performance improvement on a per sample basis.

- torch.compile doesn't work for me yet (lowest cuda version I got my 4090 to run on was 11.8 but highest cuda version on which I got the model to compile is 11.7).

- I did the optimalisations in https://arxiv.org/abs/2212.14034

1 comments

lostmsu 1253 days ago

Can you share the code and numbers so that I could compare directly with my 3090?

Do you train in fp16/bf16?

Have you tried fp8?

link