Hacker News new | ask | show | jobs
by TechTechTech 1025 days ago
At the moment of writing the cost estimate for 70B multimodal model with 7T tokens on 1000 H100 GPUs is $18,461,354 with 184 days of training time.

Anyone willing to share an estimate how cost will come down each year as hardware keeps improving and possible new methodologies are found?

Personally I would not be surprised if it is possible to train the same dataset for half the cost 12 months from now.

6 comments

It will not get cheaper until Nvidia is disrupted on the software side. There is already plenty of hardware that can do this cheaper, starting but not ending with Google’s TPU
Correct.

It requires a breakthrough in software in finding new efficient methods in training, fine-tuning, these AI models which currently there is no way around it other than training the whole thing and burning millions in the process.

Until then, unless you are a big tech company that can eat the cost, it doesn't seem wise to waste your entire VC money on expensive fine-tuning and inference costs as your AI model scales to millions.

I think this has a lot of potential https://www.modular.com/engine
At this point, I think there is sufficient motivation (dramatically high training costs) that we could see major algorithmic, architectural, and/or training methodology improvements at the code level that make these sorts of things possible on commodity hardware within a few years.

We're already starting to see that with a few projects and I think once the scale tips such that it becomes practical to train something of GPT 4 quality with < $10k, the main focus of current research will shift to generating new models trained on commodity hardware.

My true hope is that the entire problem domain eventually ends up falling within the range of commodity hardware and FANG finds it can't really add any value (other than perhaps convenience) regardless of their superior compute resources, resulting in massive democratization of this technology.

That will of course open things up and make LLMs more accessible to bad actors, but this is ultimately a much better thing than the likes of FANG / OpenAI / etc being the sole gatekeepers of this tech. Just like Google has very little real motivation to fight click-fraud (there have been rumors for years that it is responsible for a double-digit percent of their revenue), these mega corporations will have very little real motivation to stop "bad actors" from paying to use their APIs, so the democratized situation is the less Orwellian one ultimately, since bad actors are going to use it either way.

You can train it at half the cost today if you use LambdaLabs cluster at $1.89/H100/hr.

https://lambdalabs.com/service/gpu-cloud/reserved

Well if you select "trainium nodes", it's already "only" $11,085,287.
Are there big reasons the training can’t be done SETI at home style - you could even pay people for use of their graphics cards and do the training multiple times on different machines to make sure results weren’t being gamed.
There is that, I think it's https://vast.ai/ and pretty sure there is also a "community" one I've seen for gen AI but I can't remember the name.
AI Horde
Yes that's what I was thinking of, thanks! https://aihorde.net
Is this inference not training?
Yes it's inference only (and usually pretty slow at that).
Training still relies on very low latency connection between all the devices. When distributing training across multiple machines most people use machines in close vicinity connected via infiniband to have the lowest possible latency.

Going from that to the dozens to hundreds of milliseconds of latency on the internet, or the hours if you do classical SETI@Home, is a big step. There are people working on it though.

GPU memory bandwidth is a limiting factor for how fast training can happen, so it’s much more efficient to train models on locally connected high memory GPUs.

Also gradient updates from all nodes would need to get combined at least every few training steps, and it would take a while to sync all gradient updates across the network.

There's Petals[0], but the problem seems to be that the entire training data needs to be loaded into VRAM and can't be split up across devices.

[0] https://github.com/bigscience-workshop/petals

In 10 years you will be able to do it at home on a machine that costs less than $5k