|
|
|
|
|
by cavisne
590 days ago
|
|
The past few years there has been a weird situation where Google and AWS have had worse GPU's than smaller providers like Coreweave + Lambda Labs. This is because they didn't want to buy into Nvidias proprietary Infiniband stack for GPU-GPU networking, and instead wanted to make it work on top of their ethernet (but still pretty proprietary) stack. The outcome was really bad GPU-GPU latency & bandwidth between machines. My understanding is ConnectX is Nvidias supported (and probably still very profitable) way for these hyperscalers to use their proprietary networks without buying Infiniband switches and without paying the latency cost of moving bytes from the GPU to the CPU. |
|
RoCE is IB over Ethernet. All the underlying documentation and settings to put this stuff together are the same. It doesn't require ConnectX NIC's though. We do the same with 8x Broadcom Thor 2 NIC's (into a Broadcom Tomahawk 5 based Dell Z9864F switch) for our own 400G cluster.