Hacker News new | ask | show | jobs
by intorio 2662 days ago
At least for a commodity chip like Broadcom Tomahawk (100G), the latency is 500ns with L3 enabled, and 300ns if only L2 is enabled. Compared to the Mellanox SB7700 at 90ns ethernet has some catching up to do if latency is the end goal.

Ethernet tooling for HPC has a ways to go, but I suspect in the future it will be more competitive. Especially if specialty fabric vendors cut down on R&D.

CLOS fabric designs seem to be winning the war these days which I think favors Ethernet in the long run. Better flow distribution on aggregate links and now widespread support for MC-LAG means you can build a really wide CLOS network with L2-only.

2 comments

You're comparing apples and oranges. Mellanox has Ethernet switches with 300ns L3 latencies -- far lower than their broadcom counterparts. So it's not an Ethernet limitation, but a broadcom limitation.
Do keep in mind that having 300ns L3 latencies comes with it's own set of problems. Even at 10Gbit 300ns is not enough to get a packet through an electrical connection. Plus they also have some 90ns latency products.

That means that these switches, while fast, cannot check the packets for correctness (they don't have the full packet). That they will have "aborted" packets. That in some important ways these networks have the problems of the "half-duplex" networks of old.

Broadcom focuses on features for packet transmission. That means these Mellanox switches are pretty much restricted to situations where you want to have a set of servers on a single network segment and nothing else (not even an upstream connection). If that's exactly what you need, great. But mostly you're going to need more.

If you have so many CRC errors that cut-through bothers you, you might want to investigate why your cabling is so damaged.

Your information may be old; Mellanox has pretty much the same feature set as Broadcom now.

With Mellanox ethernet switches, you lose some other features in exchange for lower latency. i.e. you can only break out 16 ports to 4x25G on a 100G switch (64 25G ports).
And you also get some benefits: enough internal bandwidth to allow every input port to cut through at line rate, and a single buffer to prevent starvation (the tomahawk chip has 4 groups of ports, each with a separate buffer).

Also, you can run cumulus on the switch, which is pretty awesome.

CLOS fabric topology is the most common deployment with IB, so I'm not sure how that plays into the hands of ethernet?

But yes, seems EVPN + VXLAN is the way the industry is going nowadays to build eth CLOS fabrics, whereas Trill & SPB seem more or less dead, for some reason.

Everybody is saying ethernet is simpler to manage than IB, but IME at least for HPC the opposite is true. IB is more or less plug and play, you get RDMA, multipathing etc. all right out the box. Whereas if you'd set up an equivalent thing with ethernet, you'd have to set up DCB, RoCEv2, EVPN+VXLAN+BGP (or something equivalent).