| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ipieter 424 days ago
	Distributing inference per layer, instead of splitting each layer across gpus, is indeed another approach, called pipeline parallelism. However, per batch there is less compute (only 1 gpu at a time), so inference is slower. In addition, the orchestration of starting the next batch on gpu #0 while gpu #1 starts is quite tricky. For this reason, tensor parallelism as I described is way more common in LLM inference.

1 comments

In what software? llama.cpp and others divide things by layers.