|
|
|
|
|
by HPsquared
521 days ago
|
|
The models are too large to fit on a desktop GPU's VRAM. Progress would either require smaller models (MoE might help here? not sure) or bigger VRAM. For example training a 70 billion parameter model would require at least 140GB of VRAM in each system, whereas a large desktop GPU (4090) has only 24GB. You need enough memory to run the unquantized model for training, then stream the training data through - that part is what is done in parallel, farming out different bits of training data to each machine. |
|
https://www.microsoft.com/en-us/research/blog/zero-deepspeed...
The communications overhead of doing this over the internet might be unworkable though.