Hacker News new | ask | show | jobs
by HPsquared 521 days ago
The models are too large to fit on a desktop GPU's VRAM. Progress would either require smaller models (MoE might help here? not sure) or bigger VRAM. For example training a 70 billion parameter model would require at least 140GB of VRAM in each system, whereas a large desktop GPU (4090) has only 24GB.

You need enough memory to run the unquantized model for training, then stream the training data through - that part is what is done in parallel, farming out different bits of training data to each machine.

1 comments

Data parallel training is not the only approach. Sometimes the model itself needs to be distributed across multiple GPU.

https://www.microsoft.com/en-us/research/blog/zero-deepspeed...

The communications overhead of doing this over the internet might be unworkable though.

or if the internet became significantly faster fiber connections
A single GPU has memory bandwidth around 1000 GB/s ... that's a lot of fiber! (EDIT: although the PCIE interconnect isn't as fast, of course. NVLink is pretty fast though which is the sort of thing you'd be using in a large system)
Latency still matters a lot…