Y
Hacker News
new
|
ask
|
show
|
jobs
by
Jabrov
42 days ago
Yes multiple GPUs absolutely help with inference even for a single model instance. Some models are simply too big to fit on the largest available GPU.
Check out tensor parallelism
1 comments
zozbot234
42 days ago
Tensor parallelism is not useful on consumer platforms with slow interconnects, unless compute is really low and you prioritize decreasing latency over throughput. pipeline parallelism (and potentially expert parallelism) are more workable.
link