Hacker News new | ask | show | jobs
by riskable 805 days ago
> Twenty-four 200 gigabit (Gb) Ethernet ports are integrated into every Intel Gaudi 3 accelerator

WHAT‽ It's basically got the equivalent of a 24-port, 200-gigabit switch built into it. How does that make sense? Can you imaging stringing 24 Cat 8 cables between servers in a single rack? Wait: How do you even decide where those cables go? Do you buy 24 Gaudi 3 accelerators and run cables directly between every single one of them so they can all talk 200-gigabit ethernet to each other?

Also: If you've got that many Cat 8 cables coming out the back of the thing how do you even access it? You'll have to unplug half of them (better keep track of which was connected to what port!) just to be able to grab the shell of the device in the rack. 24 ports is usually enough to take up the majority of horizontal space in the rack so maybe this thing requires a minimum of 2-4U just to use it? That would make more sense but not help in the density department.

I'm imagining a lot of orders for "a gradient" of colors of cables so the data center folks wiring the things can keep track of which cable is supposed to go where.

7 comments

See https://www.nextplatform.com/2024/04/09/with-gaudi-3-intel-c... for more details. Here’s the relevant bits, although you should visit the article to see the networking diagrams:

> The Gaudi 3 accelerators inside of the nodes are connected using the same OSFP links to the outside world as happened with the Gaudi 2 designs, but in this case the doubling of the speed means that Intel has had to add retimers between the Ethernet ports on the Gaudi 3 cards and the six 800 Gb/sec OSFP ports that come out of the back of the system board. Of the 24 ports on each Gaudi 3, 21 of them are used to make a high-bandwidth all-to-all network linking those Gaudi 3 devices tightly to each other. Like this:

> As you scale, you build a sub-cluster with sixteen of these eight-way Gaudi 3 nodes, with three leaf switches – generally based on the 51.2 Tb/sec “Tomahawk 5” StrataXGS switch ASICs from Broadcom, according to Medina – that have half of their 64 ports running at 800 GB/sec pointing down to the servers and half of their ports pointing up to the spine network. You need three leaf switches to do the trick:

> To get to 4,096 Gaudi 3 accelerators across 512 server nodes, you build 32 sub-clusters and you cross link the 96 leaf switches with a three banks of sixteen spine switches, which will give you three different paths to link any Gaudi 3 to any other Gaudi 3 through two layers of network. Like this:

The cabling works out neatly in the rack configurations they envision. The idea here is to use standard Ethernet instead of proprietary Infiniband (which Nvidia got from acquiring Mellanox). Because each accelerator can reach other accelerators via multiple paths that will (ideally) not be over-utilized, you will be able to perform large operations across them efficiently without needing to get especially optimized about how your software manages communication.

The PCI-e HL-338 version is also listing 24 200GbE RDMA nics in a dual-slot configuration. How would they be connected?
They may go to the top of the card where you can use an SLI-like bridge to connect multiple cards.
Infiniband I've heard as incredibly annoying to deal with procuring as well as some other aspects of it, so lots of folks very happy to get RoCE (ethernet) working instead, even if it is a bit cumbersome.
"RoCE"? Woah, I had to Google that.

https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet

    > RDMA over Converged Ethernet (RoCE) or InfiniBand over Ethernet (IBoE)[1] is a network protocol which allows remote direct memory access (RDMA) over an Ethernet network. It does this by encapsulating an InfiniBand (IB) transport packet over Ethernet.
Sounds very cool.
Is there any Infiniband vendor left other than Nvidia?
200gb is not going to be using CAT, it will be fiber (or direct attached copper cable as noted by dogma1138) with a QSFP interface
It will most likely use copper QSFP56 cables since these interfaces are either used in inter rack or adjacent rack direct attachments or to the nearest switch.

O.5-1.5/2m copper cables are easily available and cheap and 4-8m (and even longer) is also possible with copper but tends to be more expensive and harder to get by.

Even 800gb is possible with copper cables these days but you’ll end up spending just as much if not more on cabling as the rest of your kit…https://www.fibermall.com/sale-460634-800g-osfp-acc-3m-flt.h...

Fair point!
For Gaudi2, it looks like 21/24 ports are internal to the server. I highly doubt those have actual individual cables. Most likely they're just carried on PCBs like any other signal.

100GBe is only supported on twinax anyway, so Cat8 is irrelevant here. The other 3 ports are probably QSFP or something.

Audio folks solved the "which cable goes where" problem ages ago with cable snakes: https://www.seismicaudiospeakers.com/products/24-channel-xlr...

But I'm not how big and how expensive a 24 channel cat 8 snake would be (!).

I wouldn’t think that would be appropriate for Ethernet due to cross talk.
Four-lane and eight-lane twinax cables exist; I think each pair is individually shielded. Beyond that there's fiber.
Those cables definitely exist for Ethernet, and regarding cross talk, that's what shielding is for.

Although not for 200 Gbps, at that rate you either use big twinax DACs, or go to fibre.

Rainbow parens, meet rainbow tables.
The amount of power that will use up is massive, they should've gone for some fiber instead
It will be fiber, Ethernet is just the protocol not the physical interface.
The fiber optics are also extremely power hungry. For short runs people use direct attach copper cables to avoid having to deal with fiberoptics.