Hacker News new | ask | show | jobs
by ipieter 377 days ago
Distributing inference per layer, instead of splitting each layer across gpus, is indeed another approach, called pipeline parallelism. However, per batch there is less compute (only 1 gpu at a time), so inference is slower. In addition, the orchestration of starting the next batch on gpu #0 while gpu #1 starts is quite tricky. For this reason, tensor parallelism as I described is way more common in LLM inference.
1 comments

In what software? llama.cpp and others divide things by layers.