Hacker News new | ask | show | jobs
by ftufek 848 days ago
Really depends on the model and the software tricks you're using. With DDP and gradient accumulation, you can reduce the bandwidth bottleneck by quite a bit. We've trained with 4090s running at x4 lanes with very small impact. And running at x4 means you can stuff up to 26-28GPUs on a single cpu node (say epyc) and get PCIe latency and get rid of networking hassle.
2 comments

Interesting, I would expect the impact to be noticeable at 4x! and yeah it heavily depends on model, sharding method, model vs data parallel. I’m hitting the peak bandwidth due to a very wide, shallow model that is split between each GPU model parallel and with CPU optimizer offload - so worst case scenario there.

But it does kind of validate Nvidia’s choice to remove nvlink. How useful would it really be if x4 PCIE gets reasonably decent perf? Unless your inner dim is massive or something you should be fine.

Do you have any pictures and/or documentation of that setup, power draw and performance? It sounds pretty interesting!
Never got around to writing some public docs. It's essentially bunch of GPUs on custom aluminum extrusion frames sitting in a server rack, connected to romed8-2t motherboard through pcie splitters.

Power limited to 240w, negligible performance loss while halving energy usage, uses 3 20a circuits.

Performance can range anywhere from 2x4090=1xa100 to 4x4090=1xa100 depending on models, etc.

It's great value for the money, and very easy to resell as well.

Very nice!

240W?

3 x 20A = 6600W?

I meant each card is limited to 240w, instead of the usual 450w. Also, it's more like 4 circuits after all, because the main cpu/mb/2gpus are on a 15a too.
Ah! Ok, thank you now I get it. That's a very nice rig you have there. So at a guess you didn't care as much about the peak computing capacity as long as whatever you are doing all fits in GPU memory and this is your way of collecting that much memory in a single machine so you still have reasonable interconnect speeds between GPUs?
Yeah, it's really just trying to get as much compute as possible as cheaply as possible interconnected in a reasonably fast way with low latency. Slow networking would be a bottleneck and expensive high end networking would defeat the purpose of staying cheap.
240W per card probably
Indeed.