| HN Mirror

The bottleneck for most model training sizes is VRAM, and since each 4090 has 24 GB VRAM, that's 96 GB VRAM total. The article mentions that it can train LLMs from scratch up to 1 billion hyperparameters, which tracks.

Nowadays that's not a lot: a single H100 that you can now rent has 80 GB VRAM, and doesn't have the technical overhead of handling work across GPUs.

tmostak 579 days ago

You should be able to train/full-fine-tune (i.e. full weight updates, not LoRA) a much larger model with 96GB of VRAM. I generally have been able to do a full fine-tune (which is equivalent to training a model from scratch) of 34B parameter models at full bf16 using 8XA100 servers (640GB of VRAM) if I enable gradient checkpointing, meaning a 96GB VRAM box should be able to handle models of up to 5B parameters. Of course if you use LoRA, you should be able to go much larger than this, depending on your rank.

Definitely agree but part of the reason why i built this to learn about all the overhead and gotchas

llm_nerd 579 days ago

Is there a reason you used hyperparameters rather than parameters? I was going to politely correct the terminology but you seem to be in AI for some time so either it was a mistype or I am misunderstanding what you are referencing.

didgeoridoo 579 days ago

I imagine that when you get really deep into model training, it can seem like there are a billion hyperparameters you have to worry about.

It's a force of habit, parameters would be more accurate (almost everyone uses them interchangeably nowadays)

unixpickle 579 days ago

Wait what? Who actually calls trainable params "hyperparameters"? Nobody at OpenAI does, as far as I know.

People who are making quick social media posts while taking a casual walk outside on websites that don't make it easy to edit posts and are not expecting to be nitpicked about it.

Overall, it's something I've seen very often on social media and less technical articles about LLMs. OpenAI would fall into the "almost" category.

llm_nerd 579 days ago

It's okay to say that you mistyped or whatever, while taking a casual walk outside on websites that don't make it easy to edit posts and are not expected to be nitpicked about it. Throwing in that everyone uses them interchangeably, however, is just profoundly wrong on every level.

I wasn't nitpicking. It is a HUGE differentiation, and I pointed it out specifically because people pick up on terminology so people who might not know better will go forward and just drop in the more super duper hyperparameter, not realizing that it makes them look like they don't know what they're talking about. As I said in the other post, no one who knows anything uses them interchangeably. It is just completely wrong.

Bancakes 579 days ago

I doubt the RAM is added up. I think that’s only a feature reserved for their NVLinked HPC series cards. In fact, without nvlink, I don’t see how you’d connect them together to compute a single task in a performant and efficient way.

It depends on how the parallelism is implemented, e.g. distributed data parallel (DDP) to synchronize gradients: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

It's a rabbit hole I stay away from for pragmatic reasons.

whimsicalism 579 days ago

yeah essentially this

Here is some additional journey apart from the rig. https://sabareesh.com/posts/llm-intro/

layer8 579 days ago

How long does training a 1B or 500M model take approximately on the 4-GPU setup? Or does that dramatically depend on the training data? I didn’t see that info on your pages.

Roughly it takes 7 days to train on 100B tokens on 500M model

paxys 579 days ago

And where you get the training data from.

Start with FineWebEdu