Hacker News new | ask | show | jobs
by _just7_ 530 days ago
I would be much more intrested in a piece on what you can train with this kind of rig, rather than the rig itself
3 comments

The bottleneck for most model training sizes is VRAM, and since each 4090 has 24 GB VRAM, that's 96 GB VRAM total. The article mentions that it can train LLMs from scratch up to 1 billion hyperparameters, which tracks.

Nowadays that's not a lot: a single H100 that you can now rent has 80 GB VRAM, and doesn't have the technical overhead of handling work across GPUs.

You should be able to train/full-fine-tune (i.e. full weight updates, not LoRA) a much larger model with 96GB of VRAM. I generally have been able to do a full fine-tune (which is equivalent to training a model from scratch) of 34B parameter models at full bf16 using 8XA100 servers (640GB of VRAM) if I enable gradient checkpointing, meaning a 96GB VRAM box should be able to handle models of up to 5B parameters. Of course if you use LoRA, you should be able to go much larger than this, depending on your rank.
Definitely agree but part of the reason why i built this to learn about all the overhead and gotchas
Is there a reason you used hyperparameters rather than parameters? I was going to politely correct the terminology but you seem to be in AI for some time so either it was a mistype or I am misunderstanding what you are referencing.
I imagine that when you get really deep into model training, it can seem like there are a billion hyperparameters you have to worry about.
It's a force of habit, parameters would be more accurate (almost everyone uses them interchangeably nowadays)
Wait what? Who actually calls trainable params "hyperparameters"? Nobody at OpenAI does, as far as I know.
People who are making quick social media posts while taking a casual walk outside on websites that don't make it easy to edit posts and are not expecting to be nitpicked about it.

Overall, it's something I've seen very often on social media and less technical articles about LLMs. OpenAI would fall into the "almost" category.

It's okay to say that you mistyped or whatever, while taking a casual walk outside on websites that don't make it easy to edit posts and are not expected to be nitpicked about it. Throwing in that everyone uses them interchangeably, however, is just profoundly wrong on every level.

I wasn't nitpicking. It is a HUGE differentiation, and I pointed it out specifically because people pick up on terminology so people who might not know better will go forward and just drop in the more super duper hyperparameter, not realizing that it makes them look like they don't know what they're talking about. As I said in the other post, no one who knows anything uses them interchangeably. It is just completely wrong.

I doubt the RAM is added up. I think that’s only a feature reserved for their NVLinked HPC series cards. In fact, without nvlink, I don’t see how you’d connect them together to compute a single task in a performant and efficient way.
It depends on how the parallelism is implemented, e.g. distributed data parallel (DDP) to synchronize gradients: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

It's a rabbit hole I stay away from for pragmatic reasons.

yeah essentially this
Here is some additional journey apart from the rig. https://sabareesh.com/posts/llm-intro/
How long does training a 1B or 500M model take approximately on the 4-GPU setup? Or does that dramatically depend on the training data? I didn’t see that info on your pages.
Roughly it takes 7 days to train on 100B tokens on 500M model
And where you get the training data from.
Start with FineWebEdu