|
|
|
|
|
by riskable
805 days ago
|
|
> Twenty-four 200 gigabit (Gb) Ethernet ports are integrated into every Intel Gaudi 3 accelerator WHAT‽ It's basically got the equivalent of a 24-port, 200-gigabit switch built into it. How does that make sense? Can you imaging stringing 24 Cat 8 cables between servers in a single rack? Wait: How do you even decide where those cables go? Do you buy 24 Gaudi 3 accelerators and run cables directly between every single one of them so they can all talk 200-gigabit ethernet to each other? Also: If you've got that many Cat 8 cables coming out the back of the thing how do you even access it? You'll have to unplug half of them (better keep track of which was connected to what port!) just to be able to grab the shell of the device in the rack. 24 ports is usually enough to take up the majority of horizontal space in the rack so maybe this thing requires a minimum of 2-4U just to use it? That would make more sense but not help in the density department. I'm imagining a lot of orders for "a gradient" of colors of cables so the data center folks wiring the things can keep track of which cable is supposed to go where. |
|
> The Gaudi 3 accelerators inside of the nodes are connected using the same OSFP links to the outside world as happened with the Gaudi 2 designs, but in this case the doubling of the speed means that Intel has had to add retimers between the Ethernet ports on the Gaudi 3 cards and the six 800 Gb/sec OSFP ports that come out of the back of the system board. Of the 24 ports on each Gaudi 3, 21 of them are used to make a high-bandwidth all-to-all network linking those Gaudi 3 devices tightly to each other. Like this:
> As you scale, you build a sub-cluster with sixteen of these eight-way Gaudi 3 nodes, with three leaf switches – generally based on the 51.2 Tb/sec “Tomahawk 5” StrataXGS switch ASICs from Broadcom, according to Medina – that have half of their 64 ports running at 800 GB/sec pointing down to the servers and half of their ports pointing up to the spine network. You need three leaf switches to do the trick:
> To get to 4,096 Gaudi 3 accelerators across 512 server nodes, you build 32 sub-clusters and you cross link the 96 leaf switches with a three banks of sixteen spine switches, which will give you three different paths to link any Gaudi 3 to any other Gaudi 3 through two layers of network. Like this:
The cabling works out neatly in the rack configurations they envision. The idea here is to use standard Ethernet instead of proprietary Infiniband (which Nvidia got from acquiring Mellanox). Because each accelerator can reach other accelerators via multiple paths that will (ideally) not be over-utilized, you will be able to perform large operations across them efficiently without needing to get especially optimized about how your software manages communication.