| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by HPsquared 521 days ago
	The models are too large to fit on a desktop GPU's VRAM. Progress would either require smaller models (MoE might help here? not sure) or bigger VRAM. For example training a 70 billion parameter model would require at least 140GB of VRAM in each system, whereas a large desktop GPU (4090) has only 24GB. You need enough memory to run the unquantized model for training, then stream the training data through - that part is what is done in parallel, farming out different bits of training data to each machine.

1 comments

mr_toad 521 days ago

Data parallel training is not the only approach. Sometimes the model itself needs to be distributed across multiple GPU.

https://www.microsoft.com/en-us/research/blog/zero-deepspeed...

The communications overhead of doing this over the internet might be unworkable though.

link

htrp 521 days ago

or if the internet became significantly faster fiber connections

link

HPsquared 521 days ago

A single GPU has memory bandwidth around 1000 GB/s ... that's a lot of fiber! (EDIT: although the PCIE interconnect isn't as fast, of course. NVLink is pretty fast though which is the sort of thing you'd be using in a large system)

link

brookst 521 days ago

Latency still matters a lot…

link