Hacker News new | ask | show | jobs
by skummetmaelk 456 days ago
The fact that you can unironically put the "only" modifier on a training time of 2.8 million GPU hours is nuts.
2 comments

If they have a cluster with 2,000 H800 GPUs (which is what they have stated in public) training would take 2,800,000 / (2,000 * 24 * 30) ~ 2 months.

A cluster of 2,000 GPUs is what a second tier AI lab has access to. And it shows that you can play in the state of the art LLM-game with some capital and a lot of brains.

Isn't the price of an H800 like $30k?

I don't know what your household budget is, but $60M might not be what most people associate with "some capital".

It is a lot less than what Google, OpenAI etc have.

And the GPUs would be a shared resource so what you should calculate is what it would have cost to rent them - probably something like 2 m.

Yesterday GPT asked me if I'd like to train a small LLM and I laughed out loud.

That being said I'm amazed how far 1B models have come. I remember when TinyLlama came out a few years ago, it was not great. ($40K training cost iirc.)

That was a 1B model, but these days even 0.5B models are remarkably coherent.

An H100 has 14592 CUDA cores. 2000 * 14592 already gives you more than 2 million cores.
Can someone put this into perspective? I'm finding heterogenous data on other models, i.e. number of tokens, number of GPUs used, cost, etc. It's hard to compare it all.