| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by arturventura 1262 days ago

This is really good, and I was really excited by it but then I read:

> running on a single 8XA100 40GB node in 38 hours of training

This is a $40-80k machine. Not a diss, but I would love to see an advance that would allow anyone with a high end computer to be able to improve on this model. Before that happens this whole field is going to be owned by big corporations.

15 comments

pavlov 1262 days ago

I don't know if that's a blocker. Ordinary people commonly rent a $40k machine for 38 hours from companies like Avis and Hertz.

If training a large model now costs the same as driving to visit grandma, that seems like a pretty good deal.

link

jetrink 1262 days ago

That's a great comparison. For a real number, I just checked Runpod and you can rent a system with 8xA100 for $17/hr or ~$700 for 38 hours. Not cheap, but also pretty close to the cost of renting a premium vehicle for a few days. I've trained a few small models by renting an 1xA5000 system and that only costs $0.44/hr, which is perfect for learning and experimentation.

link

amelius 1262 days ago

It would be great if a tradeoff could be made, though. For example, train at 1/10th the speed for 1/10th of the cost.

This could correspond to taking public transport in your analogy, and would bring this within reach of most students.

link

londons_explore 1262 days ago

Slower training tends to be only a little cheaper, because most modern architectures parallelize well, and they just care about the number of flops.

If you want to reduce cost, you need to reduce the model size, and you'll get worse results for less money.

link

mk_stjames 1262 days ago

The problem with that is currently, the available memory scales with the class of GPU.... and very large language models need 160-320GB of VRAM. So, there sadly isn't anything out there that you can load up a model this large on except a rack of 8x+ A40s/A100s.

I know there are memory channel bandwidth limits and whatnot but I really wish there was a card out there with a 3090 sized die but with 96GB of VRAM solely to make it easier to experiment with larger models. If it takes 8 days to train vs. 1, thats fine. having only two of them to get 192GB and still fit on a desk and draw normal power would be great.

link

buildbot 1262 days ago

Technically this is not true- there are a lot of techniques to shard models and store activation between layers or even smaller subcomponents of the network. For example, you can split the 175B parameter bloom model into separate layers, load up a layer, read the prev. layers input from disk, and save the output to disk.

And NVIDIA does make cards like you are asking for - the A100 is the fast memory offering, the A40 the bulk slower memory (though they added the 80GB A100 and did not double the A40 to 96GB so this is less true now than the P40 vs P100 gen).

Oddly, you can get close to what you are asking for with a M1 Mac Studio - 128GB of decently fast memory with a GPU that is ~0.5x a 3090 in training.

link

sbrother 1262 days ago

Do you know if there's any work on peer-to-peer clustering of GPU resources over the internet? Imagine a few hundred people with 1-4 3080Tis each, running software that lets them form a cluster large enough to train and/or run a number of LLMs. Obviously the latency between shards would be orders of magnitude higher than a colocated cluster, but I wonder if that could be designed around?

link

pizza 1262 days ago

Bloom-petals

link

amelius 1262 days ago

I guess this would only become a reality if games started requiring these cards.

link

mcbuilder 1262 days ago

Well if it used to cost you $1 for 1hr at 1x speed, now it will take you 10hr at 0.1x speed, and if my math checks out $1. You need to shrink the model.

link

amelius 1262 days ago

But of course now you run it on your own computer instead of in the DC, which changes the numbers. Especially if your student dorm has a shared electricity bill :)

link

willseth 1262 days ago

The good news is that, unlike vehicles, the rate for rented compute will continue to drop

link

Apofis 1262 days ago

Let's not forget that rendering 3D Animations in 3DSMAX or Maya used to take days for a single frame for a complex scene, and months for a few minutes.

link

swader999 1262 days ago

You have to gas it up and heaven help you if it gets a scratch or a scuff.

link

speed_spread 1262 days ago

Great news! Cloud instances energy usage is included in their price, and because they're remote and transient it's impossible to permanently damage them.

link

aequitas 1262 days ago

I think the equivalent of being not careful and getting a dent in this context is to leave it open to the internet and having a bitcoin miner installed.

link

Aissen 1262 days ago

You free the instance and the miner is gone.

link

iso1631 1262 days ago

As you are paying for the resources you use that's fine.

The closest would be if you used some form of software bug to actually cause physical damage, certainly not impossible, but extremely unlikely compared with actually physically damaging a car.

link

idonotknowwhy 1262 days ago

A better fit would be, if you have unlimited liability like with AWS, and you leak your key pair. Then someone runs up a 100k bill setting up mining instances

link

DesiLurker 1262 days ago

but you still have to pay for network ingress/egress traffic.

link

ofcourseyoudo 1262 days ago

Similarly maybe we should only let people rent a NanoGPT box if they are over 25 and they have to get collision insurance.

link

Tepix 1262 days ago

If you can fit the training into 24GB, a used RTX 3090 for $700-$800 seems like a good deal at the moment. They are about 45-65% as fast as the A100 according to https://bizon-tech.com/gpu-benchmarks/NVIDIA-RTX-3090-vs-NVI...

So if you buy two of these cards it will take 12-13 days instead of 38 hours but only require a $2500 PC.

James Betker, who created tortoise TTS, built his own $15k machine with 8x RTX 3090 and trained the models with it. He now works for OpenAI…

link

Tepix 1248 days ago

Recommended reading:

https://timdettmers.com/2023/01/16/which-gpu-for-deep-learni...

TL;DR: You probably don't need that expensive Threadripper because 2x PCIe 4.0 x16 will not be very beneficial. Go cheap, go 2x PCIe 4.0 x8.

link

klaudioz 1261 days ago

Any link to the 15k machine ?. Maybe it is cheaper now.

link

Tepix 1261 days ago

I think it was a DIY machine, those RTX 3090 have gotten cheaper for sure. From my experience, going beyond 4 GPUs is a pricey affair. See [§]. All but one model of the RTX3090 require at least 3 slots.

If 4 GPUs connected via PCIe 4.0x16 are enough you can choose among various sRTX4 boards for 3000 series AMD Threadripper CPUs.

[§] https://www.reddit.com/r/deeplearning/comments/tw0olq/commen...

Another useful URL: https://www.pugetsystems.com/labs/articles/Quad-GeForce-RTX-...

link

wongarsu 1262 days ago

It's a $33/hour machine on AWS, so about $1250 for one training run. Not cheap, but easily in the reach of startups and educational or research institutions.

Edit: or about $340 if you get the 8xA100 instance from lambdalabs, in the realm of normal hobby spending

link

belter 1262 days ago

Or $9/hour if you use Spot :-)

https://aws.amazon.com/ec2/spot/pricing/

link

snerbles 1262 days ago

Hopefully your progress gets saved in time when the spot instance inevitably gets terminated in the midst of training.

link

belter 1262 days ago

"Managed Spot Training..."

"...Spot instances can be interrupted, causing jobs to take longer to start or finish. You can configure your managed spot training job to use checkpoints. SageMaker copies checkpoint data from a local path to Amazon S3. When the job is restarted, SageMaker copies the data from Amazon S3 back into the local path. The training job can then resume from the last checkpoint instead of restarting...."

https://docs.aws.amazon.com/sagemaker/latest/dg/model-manage...

link

acetabulum 1262 days ago

If you use Horovod Elastic, I think you can avoid this problem working across a cluster of Spot instances.

https://horovod.readthedocs.io/en/stable/elastic_include.htm...

link

bobbyi 1262 days ago

If you're doing something new/ custom (which you presumably are if you aren't using someone else's prebuilt model), it could take a lot of runs to figure out the best training data and finetune settings.

(I assume. I've never worked with GPT, but have done similar work in other domains).

link

weird-eye-issue 1262 days ago

After training don't you have to keep it running if you want to use it?

link

wongarsu 1262 days ago

Just download the model and run it on something much smaller and cheaper. Bigger models like GPT-J are a bit of a pain to run, but GPT2-sized models run just fine on consumer GPUs.

link

weird-eye-issue 1262 days ago

Ahh okay, thanks. So how big is the model? Seems like it should be available to download so people don't have to train it. I understand you can train it on custom data but for a "default" model are there any available to download?

link

bilsbie 1262 days ago

What’s required to run the model?

link

wongarsu 1262 days ago

The biggest GPT2 (1.5B params) takes about 10GB VRAM, meaning it runs on a RTX 2080 TI, or the 12GB version of the RTX 3080

link

renewiltord 1262 days ago

What's the largest language model I can run on a 3090 with 24 GiB RAM?

link

JustSomeNobody 1262 days ago

https://github.com/karpathy/nanoGPT#i-only-have-a-macbook

> This creates a much smaller Transformer (4 layers, 4 heads, 64 embedding size), runs only on CPU, does not torch.compile the model (torch seems to give an error if you try), only evaluates for one iteration so you can see the training loop at work immediately, and also makes sure the context length is much smaller (e.g. 64 tokens), and the batch size is reduced to 8. On my MacBook Air (M1) this takes about 400ms per iteration. The network is still pretty expensive because the current vocabulary is hard-coded to be the GPT-2 BPE encodings of vocab_size=50257. So the embeddings table and the last layer are still massive. In the future I may modify the code to support simple character-level encoding, in which case this would fly. (The required changes would actually be pretty minimal, TODO)

link

windexh8er 1262 days ago

But how often do you need to run this? You can run 8xA1000 on LambdaLabs [0] (no affiliation) for $8.80/hr. So you should be able to run the entire data set for less than $350.

[0] https://lambdalabs.com/service/gpu-cloud#pricing

link

throwawaymaths 1262 days ago

They are acknowledged at the bottom for supporting andrej's research!!

link

jph00 1262 days ago

A couple of weeks ago a new paper came out that shows how to train a high quality language model on a single GPU in one day.

https://arxiv.org/abs/2212.14034

link

haldujai 1262 days ago

If you can’t fit the model on your resources you can leverage DeepSpeed’s ZeRO-offload which will let you train GPT2 on a single V100 (32gb).

Alternatively, if you’re researching (with the caveat that you have to either publish, open source or share your results in a blog post) you can also get access to Google’s TPU research cloud which gives you a few v3-8s for 30 days (can’t do distributed training on devices but can run workloads in parallel). You can also ask nicely for a pod, I’ve been granted access to a v3-32 for 14 days pretty trivially which (if optimized) has more throughput than 8xA100 on transformer models.

TPUs and moreso pods are a bit harder to work with and TF performs far better than PyTorch on them.

https://www.deepspeed.ai/tutorials/zero-offload/

https://medium.com/analytics-vidhya/googles-tpu-research-clo...

link

dceddia 1262 days ago

I was curious about how much this would be to rent, because definitely the cost of those servers is outside the budget! Lambda has 8xA100 40gb for $8.80/hr: https://lambdalabs.com/service/gpu-cloud#pricing

link

Tenoke 1262 days ago

It seems as likely as people being able to build big automaker level of cars just with tools in their garage. More compute is going to keep producing better results at least for LLMs.

link

kzrdude 1262 days ago

How are universities and colleges dealing with this kind of demand for computing power? It must be hard to be able to do some courses now.

link

CuriouslyC 1262 days ago

Most decently large colleges have been investing in HPC for a while, and started investing in GPU HPC around 2014. You'd be surprised what sort of school projects the compute budget exists for.

link

r3trohack3r 1262 days ago

I went to a smallish state university, even there we had our own HPC center and lab. We had a proper HPC (IIRC) 6 row data center across campus and we had a continuous budget available to me as an undergraduate research assistant for building beowulf clusters for the graduate programs to run assignments on. I once got an allowance to buy 15 raspberry pis to build an arm cluster.

link

TrackerFF 1262 days ago

As far as research groups go - they get funds (project grants, donations, etc.) to purchase machines and parts, and then users have to timeshare them.

These machines are pretty much crunching numbers 24/7, and your project will get appended to a queue.

link

londons_explore 1262 days ago

'group project'

link

ProjectArcturis 1262 days ago

That's to train it from scratch, though, right? If you preload the GPT2 weights you don't need to do this. You can just give it additional training on your texts.

link

anigbrowl 1262 days ago

Well, he does include instructions for running it on a personal computer, which looks like what I'm gonna be doing next week.

Besides the rental options discussed below these nvidia boxen don't look too big so either used ones will be available for cheap relatively soon, or you could just locate and liberate one in Promethean fashion.

link

anilshanbhag 1262 days ago

If GPT-2 / nanoGPT needs this setup, just imagine what GPT3 / chatGPT needs!

link

Gigachad 1262 days ago

Supposedly even running the trained model for ChatGPT is extremely expensive unlike the image generators which can largely be run on a consumer device.

link

aidos 1262 days ago

I don’t know anything about this, but is that this instance type on AWS? p4d.24xlarge

link

base698 1262 days ago

You can rent on AWS and other cloud providers.

link

krisoft 1262 days ago

So if I see it right that would be a p4d.24xlarge instance. Which goes for about $32.77 an hour nowadays so the total training would be about $1245. Not cheap, but certainly not a nation state budget.

Edit: i just noticed lambda lab. It seems they ask $8.8 per hour for an instance of this caliber. That puts the total training cost around $334. I wonder how come it is that much cheaper.

link

liquidk 1262 days ago

That is a key difference. You can’t easily and cheaply rent an auto factory, but you’re starting to be able to rent an LLM training factory once for a model where you can then more cheaply run inference on.

link