|
|
|
|
|
by bick_nyers
587 days ago
|
|
Generally speaking this works well, pending your definition of node and the interconnect between them. If by node you mean GPU, and you have multiple of them on the same system (interconnect is PCIE, doesn't need to be full speed however for inference), you're good. If you mean multiple computers connected by 1 Gigabit Ethernet? More challenging. When splitting models layer by layer, users in r/LocalLLaMA have reported good results with as low as PCIE 3.0 x4 as the interconnect (4GB/s). For tensor parallelism, the interconnect requirements are higher but the upside can be faster speeds in accordance to number of GPUs split across (whereas layer by layer operated like a pipeline, so isn't necessarily faster than what a single GPU can provide, even if splitting across 8 GPUs). |
|
Once you have something with 192 GB it gets interesting. You could probably have 7 at FP8 per GPU. At FP16 it probably only would fit 3 per card, requiring 9 again.
I'd say for the current memory layout of cards they missed a little bit the sweet spot. With slightly smaller models or one expert less one should be able to run it on 8 H100s at FP8 or 2 B100s at FP8 or even on 4 B100s at FP16 if I calculated correctly.