Hacker News new | ask | show | jobs
by jfkfif 781 days ago
the problem is multinode runs that communicate through the network
2 comments

Multinode runs don’t communicate through the network in a DGX configuration. NVlink allows for RDMA over direct infiniband. No need for network here.
Infiniband is a network too…

But even if we set that aside you’ll get access to your data over a network connection because these are expensive nodes running batch jobs with finite disk space, not personal workstations.

Yea ofc. Nvidia has for mellanox infiniband, nvlink external, and even mellanox ethernet pci-to-pci, no need to involve the CPU. nvidia-docker has a few mods to support this too.
Yes, which is especially important for training. Getting good GPU interconnect can be really important for training large models.