Hacker News new | ask | show | jobs
by arturventura 1262 days ago
This is really good, and I was really excited by it but then I read:

> running on a single 8XA100 40GB node in 38 hours of training

This is a $40-80k machine. Not a diss, but I would love to see an advance that would allow anyone with a high end computer to be able to improve on this model. Before that happens this whole field is going to be owned by big corporations.

15 comments

I don't know if that's a blocker. Ordinary people commonly rent a $40k machine for 38 hours from companies like Avis and Hertz.

If training a large model now costs the same as driving to visit grandma, that seems like a pretty good deal.

That's a great comparison. For a real number, I just checked Runpod and you can rent a system with 8xA100 for $17/hr or ~$700 for 38 hours. Not cheap, but also pretty close to the cost of renting a premium vehicle for a few days. I've trained a few small models by renting an 1xA5000 system and that only costs $0.44/hr, which is perfect for learning and experimentation.
It would be great if a tradeoff could be made, though. For example, train at 1/10th the speed for 1/10th of the cost.

This could correspond to taking public transport in your analogy, and would bring this within reach of most students.

Slower training tends to be only a little cheaper, because most modern architectures parallelize well, and they just care about the number of flops.

If you want to reduce cost, you need to reduce the model size, and you'll get worse results for less money.

The problem with that is currently, the available memory scales with the class of GPU.... and very large language models need 160-320GB of VRAM. So, there sadly isn't anything out there that you can load up a model this large on except a rack of 8x+ A40s/A100s.

I know there are memory channel bandwidth limits and whatnot but I really wish there was a card out there with a 3090 sized die but with 96GB of VRAM solely to make it easier to experiment with larger models. If it takes 8 days to train vs. 1, thats fine. having only two of them to get 192GB and still fit on a desk and draw normal power would be great.

Technically this is not true- there are a lot of techniques to shard models and store activation between layers or even smaller subcomponents of the network. For example, you can split the 175B parameter bloom model into separate layers, load up a layer, read the prev. layers input from disk, and save the output to disk.

And NVIDIA does make cards like you are asking for - the A100 is the fast memory offering, the A40 the bulk slower memory (though they added the 80GB A100 and did not double the A40 to 96GB so this is less true now than the P40 vs P100 gen).

Oddly, you can get close to what you are asking for with a M1 Mac Studio - 128GB of decently fast memory with a GPU that is ~0.5x a 3090 in training.

Do you know if there's any work on peer-to-peer clustering of GPU resources over the internet? Imagine a few hundred people with 1-4 3080Tis each, running software that lets them form a cluster large enough to train and/or run a number of LLMs. Obviously the latency between shards would be orders of magnitude higher than a colocated cluster, but I wonder if that could be designed around?
Bloom-petals
I guess this would only become a reality if games started requiring these cards.
Well if it used to cost you $1 for 1hr at 1x speed, now it will take you 10hr at 0.1x speed, and if my math checks out $1. You need to shrink the model.
But of course now you run it on your own computer instead of in the DC, which changes the numbers. Especially if your student dorm has a shared electricity bill :)
The good news is that, unlike vehicles, the rate for rented compute will continue to drop
Let's not forget that rendering 3D Animations in 3DSMAX or Maya used to take days for a single frame for a complex scene, and months for a few minutes.
You have to gas it up and heaven help you if it gets a scratch or a scuff.
Great news! Cloud instances energy usage is included in their price, and because they're remote and transient it's impossible to permanently damage them.
I think the equivalent of being not careful and getting a dent in this context is to leave it open to the internet and having a bitcoin miner installed.
You free the instance and the miner is gone.
As you are paying for the resources you use that's fine.

The closest would be if you used some form of software bug to actually cause physical damage, certainly not impossible, but extremely unlikely compared with actually physically damaging a car.

A better fit would be, if you have unlimited liability like with AWS, and you leak your key pair. Then someone runs up a 100k bill setting up mining instances
but you still have to pay for network ingress/egress traffic.
Similarly maybe we should only let people rent a NanoGPT box if they are over 25 and they have to get collision insurance.
If you can fit the training into 24GB, a used RTX 3090 for $700-$800 seems like a good deal at the moment. They are about 45-65% as fast as the A100 according to https://bizon-tech.com/gpu-benchmarks/NVIDIA-RTX-3090-vs-NVI...

So if you buy two of these cards it will take 12-13 days instead of 38 hours but only require a $2500 PC.

James Betker, who created tortoise TTS, built his own $15k machine with 8x RTX 3090 and trained the models with it. He now works for OpenAI…

Recommended reading:

https://timdettmers.com/2023/01/16/which-gpu-for-deep-learni...

TL;DR: You probably don't need that expensive Threadripper because 2x PCIe 4.0 x16 will not be very beneficial. Go cheap, go 2x PCIe 4.0 x8.

Any link to the 15k machine ?. Maybe it is cheaper now.
I think it was a DIY machine, those RTX 3090 have gotten cheaper for sure. From my experience, going beyond 4 GPUs is a pricey affair. See [§]. All but one model of the RTX3090 require at least 3 slots.

If 4 GPUs connected via PCIe 4.0x16 are enough you can choose among various sRTX4 boards for 3000 series AMD Threadripper CPUs.

[§] https://www.reddit.com/r/deeplearning/comments/tw0olq/commen...

Another useful URL: https://www.pugetsystems.com/labs/articles/Quad-GeForce-RTX-...

It's a $33/hour machine on AWS, so about $1250 for one training run. Not cheap, but easily in the reach of startups and educational or research institutions.

Edit: or about $340 if you get the 8xA100 instance from lambdalabs, in the realm of normal hobby spending

Or $9/hour if you use Spot :-)

https://aws.amazon.com/ec2/spot/pricing/

Hopefully your progress gets saved in time when the spot instance inevitably gets terminated in the midst of training.
"Managed Spot Training..."

"...Spot instances can be interrupted, causing jobs to take longer to start or finish. You can configure your managed spot training job to use checkpoints. SageMaker copies checkpoint data from a local path to Amazon S3. When the job is restarted, SageMaker copies the data from Amazon S3 back into the local path. The training job can then resume from the last checkpoint instead of restarting...."

https://docs.aws.amazon.com/sagemaker/latest/dg/model-manage...

If you use Horovod Elastic, I think you can avoid this problem working across a cluster of Spot instances.

https://horovod.readthedocs.io/en/stable/elastic_include.htm...

If you're doing something new/ custom (which you presumably are if you aren't using someone else's prebuilt model), it could take a lot of runs to figure out the best training data and finetune settings.

(I assume. I've never worked with GPT, but have done similar work in other domains).

After training don't you have to keep it running if you want to use it?
Just download the model and run it on something much smaller and cheaper. Bigger models like GPT-J are a bit of a pain to run, but GPT2-sized models run just fine on consumer GPUs.
Ahh okay, thanks. So how big is the model? Seems like it should be available to download so people don't have to train it. I understand you can train it on custom data but for a "default" model are there any available to download?
What’s required to run the model?
The biggest GPT2 (1.5B params) takes about 10GB VRAM, meaning it runs on a RTX 2080 TI, or the 12GB version of the RTX 3080
What's the largest language model I can run on a 3090 with 24 GiB RAM?
https://github.com/karpathy/nanoGPT#i-only-have-a-macbook

> This creates a much smaller Transformer (4 layers, 4 heads, 64 embedding size), runs only on CPU, does not torch.compile the model (torch seems to give an error if you try), only evaluates for one iteration so you can see the training loop at work immediately, and also makes sure the context length is much smaller (e.g. 64 tokens), and the batch size is reduced to 8. On my MacBook Air (M1) this takes about 400ms per iteration. The network is still pretty expensive because the current vocabulary is hard-coded to be the GPT-2 BPE encodings of vocab_size=50257. So the embeddings table and the last layer are still massive. In the future I may modify the code to support simple character-level encoding, in which case this would fly. (The required changes would actually be pretty minimal, TODO)

But how often do you need to run this? You can run 8xA1000 on LambdaLabs [0] (no affiliation) for $8.80/hr. So you should be able to run the entire data set for less than $350.

[0] https://lambdalabs.com/service/gpu-cloud#pricing

They are acknowledged at the bottom for supporting andrej's research!!
A couple of weeks ago a new paper came out that shows how to train a high quality language model on a single GPU in one day.

https://arxiv.org/abs/2212.14034

If you can’t fit the model on your resources you can leverage DeepSpeed’s ZeRO-offload which will let you train GPT2 on a single V100 (32gb).

Alternatively, if you’re researching (with the caveat that you have to either publish, open source or share your results in a blog post) you can also get access to Google’s TPU research cloud which gives you a few v3-8s for 30 days (can’t do distributed training on devices but can run workloads in parallel). You can also ask nicely for a pod, I’ve been granted access to a v3-32 for 14 days pretty trivially which (if optimized) has more throughput than 8xA100 on transformer models.

TPUs and moreso pods are a bit harder to work with and TF performs far better than PyTorch on them.

https://www.deepspeed.ai/tutorials/zero-offload/

https://medium.com/analytics-vidhya/googles-tpu-research-clo...

I was curious about how much this would be to rent, because definitely the cost of those servers is outside the budget! Lambda has 8xA100 40gb for $8.80/hr: https://lambdalabs.com/service/gpu-cloud#pricing
It seems as likely as people being able to build big automaker level of cars just with tools in their garage. More compute is going to keep producing better results at least for LLMs.
How are universities and colleges dealing with this kind of demand for computing power? It must be hard to be able to do some courses now.
Most decently large colleges have been investing in HPC for a while, and started investing in GPU HPC around 2014. You'd be surprised what sort of school projects the compute budget exists for.
I went to a smallish state university, even there we had our own HPC center and lab. We had a proper HPC (IIRC) 6 row data center across campus and we had a continuous budget available to me as an undergraduate research assistant for building beowulf clusters for the graduate programs to run assignments on. I once got an allowance to buy 15 raspberry pis to build an arm cluster.
As far as research groups go - they get funds (project grants, donations, etc.) to purchase machines and parts, and then users have to timeshare them.

These machines are pretty much crunching numbers 24/7, and your project will get appended to a queue.

'group project'
That's to train it from scratch, though, right? If you preload the GPT2 weights you don't need to do this. You can just give it additional training on your texts.
Well, he does include instructions for running it on a personal computer, which looks like what I'm gonna be doing next week.

Besides the rental options discussed below these nvidia boxen don't look too big so either used ones will be available for cheap relatively soon, or you could just locate and liberate one in Promethean fashion.

If GPT-2 / nanoGPT needs this setup, just imagine what GPT3 / chatGPT needs!
Supposedly even running the trained model for ChatGPT is extremely expensive unlike the image generators which can largely be run on a consumer device.
I don’t know anything about this, but is that this instance type on AWS? p4d.24xlarge
You can rent on AWS and other cloud providers.
So if I see it right that would be a p4d.24xlarge instance. Which goes for about $32.77 an hour nowadays so the total training would be about $1245. Not cheap, but certainly not a nation state budget.

Edit: i just noticed lambda lab. It seems they ask $8.8 per hour for an instance of this caliber. That puts the total training cost around $334. I wonder how come it is that much cheaper.

That is a key difference. You can’t easily and cheaply rent an auto factory, but you’re starting to be able to rent an LLM training factory once for a model where you can then more cheaply run inference on.