| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dauertewigkeit 980 days ago
	Not sure how they do it specifically for LLMs, but you can do what is called model or tensor parallelism where you can split a layer over multiple GPUs or even nodes. If you look under the hood it's the same distributed matrix multiplication stuff with MPI, as far as I know. I think Deepspeed has bespoke transformer kernels which handle this stuff specifically.