|
|
|
|
|
by dauertewigkeit
933 days ago
|
|
Not sure how they do it specifically for LLMs, but you can do what is called model or tensor parallelism where you can split a layer over multiple GPUs or even nodes.
If you look under the hood it's the same distributed matrix multiplication stuff with MPI, as far as I know. I think Deepspeed has bespoke transformer kernels which handle this stuff specifically. |
|