Hacker News new | ask | show | jobs
by sailingparrot 1976 days ago
> You don't need a TPU cluster to train a working GPT-2 model [...] A free GPU on Colab gets you most of the way

I have a hard time believing you can really train it with 1 V-100, unless you are talking about an extremely scale down version of GPT-2 (large).

If you can train it at all it would be with a batch size so small (probably 1?) that it would hurt the performance and it would take months.

I am out of the loop somehow?

Edit: I was thinking about reproducing the training that OpenAI did in their paper, so redoing all the pre-training, but I realized you might have been talking about training on a smaller custom dataset.

1 comments

also, he just be talking about training a much smaller model than the 1.5B one, because that would take years maybe otherwise