If they have a cluster with 2,000 H800 GPUs (which is what they have stated in public) training would take 2,800,000 / (2,000 * 24 * 30) ~ 2 months.
A cluster of 2,000 GPUs is what a second tier AI lab has access to. And it shows that you can play in the state of the art LLM-game with some capital and a lot of brains.
Yesterday GPT asked me if I'd like to train a small LLM and I laughed out loud.
That being said I'm amazed how far 1B models have come. I remember when TinyLlama came out a few years ago, it was not great. ($40K training cost iirc.)
That was a 1B model, but these days even 0.5B models are remarkably coherent.
Can someone put this into perspective? I'm finding heterogenous data on other models, i.e. number of tokens, number of GPUs used, cost, etc. It's hard to compare it all.
A cluster of 2,000 GPUs is what a second tier AI lab has access to. And it shows that you can play in the state of the art LLM-game with some capital and a lot of brains.