Hacker News new | ask | show | jobs
by dauertewigkeit 933 days ago
Not sure how they do it specifically for LLMs, but you can do what is called model or tensor parallelism where you can split a layer over multiple GPUs or even nodes. If you look under the hood it's the same distributed matrix multiplication stuff with MPI, as far as I know.

I think Deepspeed has bespoke transformer kernels which handle this stuff specifically.