The memory and bandwidth numbers are mind blowing. Going to be very hard to catch Nvidia. It’s as if competitors are going through the motions for participation prizes.
AMD has been shipping 128x lanes of PCIe 5.0 on chip. That's 0.5TBps. Getting up to 0.9TBps isn't that crazy, but having big enough fabric & switches to attach to is a huge feat.
I have hope though. CXL switching is going to give the whole industry a very fresh look at interconnect fabrics, as a simpler to manage faster more direct alternative to PCIe. Should be good.
Personally I worry it's flogging a dead horse, has too many constraints, but Ethernet could be rumbling into action again too maybe. The hyperscalers & others created a new LinuxFoundation group "Ultra Ethernet Scaling" to scale up much faster. Still, even at 1Tbps, that's a bunch of lanes (7x) of that ultra Ethernet you'd need to get to NVlink's 0.9TBps GPU interconnect. More radical breaks with Ethernet are needed than line speed bumps, things that can make switches easier to scale out big, if this realm of tech is to be good systems fabric. https://www.linuxfoundation.org/press/announcing-ultra-ether...
One interesting note on the DGX GH200 architecture that is super interesting to me is that it's inverted the connectivity relationship. Typically a system would have NIC & GPU hanging off the processor bus, and interconnect would go over that bus (maybe optimizing with p2p-dma to skip going through main memory, if it's fancy). But here? GPUs have a 0.9TBps connection to the NVswitch. If the CPU wants to talk to the cluster, it uses nvlink c2c to send the data to the gpu that then used it's nvlink connection to the NVswitch to send it out. Interesting reversal, interesting flourish, and gee it sure makes sense to me; the GPU is the thing!
Also, past 256 GPUs, there are BlueField 3 devices for Ethernet or infiniband connectivity on DGX nodes. Which is a good but also pretty boring/standard smartnic based scale out strategy.
Gaudi2 was competitive with the A100 on paper but was borderline vaporware.
Agree for now, but long do we think this will last though.
There really hasn’t been that great of a financial incentive to compete on DL. Nvidia themselves only recently made this a major priority.
However, now that heaps of money are being thrown at massive training runs I expect we’ll see more competition popping up. Particularly if Intel pulls off IFS and catches up on the next node increasing availability.
I have hope though. CXL switching is going to give the whole industry a very fresh look at interconnect fabrics, as a simpler to manage faster more direct alternative to PCIe. Should be good.
Personally I worry it's flogging a dead horse, has too many constraints, but Ethernet could be rumbling into action again too maybe. The hyperscalers & others created a new LinuxFoundation group "Ultra Ethernet Scaling" to scale up much faster. Still, even at 1Tbps, that's a bunch of lanes (7x) of that ultra Ethernet you'd need to get to NVlink's 0.9TBps GPU interconnect. More radical breaks with Ethernet are needed than line speed bumps, things that can make switches easier to scale out big, if this realm of tech is to be good systems fabric. https://www.linuxfoundation.org/press/announcing-ultra-ether...
One interesting note on the DGX GH200 architecture that is super interesting to me is that it's inverted the connectivity relationship. Typically a system would have NIC & GPU hanging off the processor bus, and interconnect would go over that bus (maybe optimizing with p2p-dma to skip going through main memory, if it's fancy). But here? GPUs have a 0.9TBps connection to the NVswitch. If the CPU wants to talk to the cluster, it uses nvlink c2c to send the data to the gpu that then used it's nvlink connection to the NVswitch to send it out. Interesting reversal, interesting flourish, and gee it sure makes sense to me; the GPU is the thing!
Also, past 256 GPUs, there are BlueField 3 devices for Ethernet or infiniband connectivity on DGX nodes. Which is a good but also pretty boring/standard smartnic based scale out strategy.