Hacker News new | ask | show | jobs
by jabl 2656 days ago
This is interesting.

Mellanox has apparently been under activist investor pressure to reduce their R&D expenses and pay more dividends. And then there was the rumors that Intel were interested, but apparently Nvidia in the end offered more.

From a HPC perspective I think it's good Nvidia got the deal, Intel is already a quite dominating force in that market, and if they'd have gotten the deal it wouldn't have surprised me if they would just have sunsetted it in favor of their own Omni-Path (which they could then develop at a leisurely pace due to lack of competition).

Though as I have mentioned before, I do wonder about the long-term prospects for Infiniband as a technology. Modern high-end ethernet does many of the same things with RDMA (RoCE), though I believe IB still has a latency advantage. And multipathing with ethernet is weird, seems both Trill and SPB are kind of dead, and most players seem to do multipathing at the L3 level (which might not be good for latency?). And in contrast to ethernet, IB is pretty much a single-player technology nowadays, so is the market big enough to bear the R&D costs to keep developing it?

6 comments

At least for a commodity chip like Broadcom Tomahawk (100G), the latency is 500ns with L3 enabled, and 300ns if only L2 is enabled. Compared to the Mellanox SB7700 at 90ns ethernet has some catching up to do if latency is the end goal.

Ethernet tooling for HPC has a ways to go, but I suspect in the future it will be more competitive. Especially if specialty fabric vendors cut down on R&D.

CLOS fabric designs seem to be winning the war these days which I think favors Ethernet in the long run. Better flow distribution on aggregate links and now widespread support for MC-LAG means you can build a really wide CLOS network with L2-only.

You're comparing apples and oranges. Mellanox has Ethernet switches with 300ns L3 latencies -- far lower than their broadcom counterparts. So it's not an Ethernet limitation, but a broadcom limitation.
Do keep in mind that having 300ns L3 latencies comes with it's own set of problems. Even at 10Gbit 300ns is not enough to get a packet through an electrical connection. Plus they also have some 90ns latency products.

That means that these switches, while fast, cannot check the packets for correctness (they don't have the full packet). That they will have "aborted" packets. That in some important ways these networks have the problems of the "half-duplex" networks of old.

Broadcom focuses on features for packet transmission. That means these Mellanox switches are pretty much restricted to situations where you want to have a set of servers on a single network segment and nothing else (not even an upstream connection). If that's exactly what you need, great. But mostly you're going to need more.

If you have so many CRC errors that cut-through bothers you, you might want to investigate why your cabling is so damaged.

Your information may be old; Mellanox has pretty much the same feature set as Broadcom now.

With Mellanox ethernet switches, you lose some other features in exchange for lower latency. i.e. you can only break out 16 ports to 4x25G on a 100G switch (64 25G ports).
And you also get some benefits: enough internal bandwidth to allow every input port to cut through at line rate, and a single buffer to prevent starvation (the tomahawk chip has 4 groups of ports, each with a separate buffer).

Also, you can run cumulus on the switch, which is pretty awesome.

CLOS fabric topology is the most common deployment with IB, so I'm not sure how that plays into the hands of ethernet?

But yes, seems EVPN + VXLAN is the way the industry is going nowadays to build eth CLOS fabrics, whereas Trill & SPB seem more or less dead, for some reason.

Everybody is saying ethernet is simpler to manage than IB, but IME at least for HPC the opposite is true. IB is more or less plug and play, you get RDMA, multipathing etc. all right out the box. Whereas if you'd set up an equivalent thing with ethernet, you'd have to set up DCB, RoCEv2, EVPN+VXLAN+BGP (or something equivalent).

While I agree about Intel's market force, they have a much better open source software story than nvidia. Nvidia is as closed as the other big player, Broadcom.
I had similar hopes when Intel acquired Altera, however think Intel managed to completely botch that acquisition. I now believe Intels going through a phase where they seem to be struggling to get things out of the door on time (like 10nm, Optane, Xeon Phis, Drones, Nervana, Edison...). At a time where AMD seems to be credibly challenging Intel for the first time in over a decade, any non x86 efforts will likely end up as a side hobby and not get the attention/investment from Intel management that it deserves. As a result I am glad Mellanox did not end up at Intel.
There is a lot of heavily patent/trade secret encumbered IP in both graphics and compute drivers, making open source extremely difficult. Above that layer I've found Nvidia extremely open.

Intel is far worse to deal with, and additionally engages in anti-competitive architectural wars, preventing other vendors from interfacing with the CPU bus.

As a result we have NVLink, and now we will soon have official IB cards with NVLink ports, and probably ARM cores too.

>There is a lot of heavily patent/trade secret encumbered IP in both graphics and compute drivers, making open source extremely difficult. Above that layer I've found Nvidia extremely open. >Intel is far worse to deal with

Huh? Nvidia is by far the worst option when it comes to GPUs if you want to run Linux. Both Intel and AMD manage to have excellent open source drivers, while Nvidia's is a proprietary mess that everyone complains about and doesn't work all that great with typical distro update mechanisms.

We have been moving away from IB for our platform (algorithmic trading) since Ethernet now has almost comparable latency and is a lot easier to understand and manage.
Latency really isn't comparable, but now probably isn't an issue for algorithmic trading. Still many other closely coupled codes for which it dominates cluster performance.

IB is actually easier to reason about and debug than DCE, but obviously a different community of practice.

Understand, manage, and buy networking equipment for. Infiniband is a thing of the past, especially with the Mellanox VPI adapters that support both Ethernet and Infiniband with a single bit flipped on the adapter.
IDK, IME IB is pretty much plug and play in an HPC setting. Plain ethernet is too, sure, but if you want to do HPC type workloads you'll have to do a lot of configuration and testing to setup DCB, RoCEv2, EVPN+VXLAN+BGP or such.

But I think this is the way the market is going in the longer term.

It is, but if you're a large enough company to be buying millions of dollars in adapters and switches, reading a guide from Mellanox to turn on DCB should be fairly seamless. RoCEv2 is API-compatible with IB for the most part, so there is really no configuration on that layer. The other pieces -- not really sure what you're getting at. Most of those are for going across data centers, which IB won't do anyways. At least Ethernet would give you the option to run RDMA from the east coast to the west coast.
> The other pieces -- not really sure what you're getting at.

What I'm getting at is setting up clusters larger than what you can fit behind a single switch. So you'll want e.g. a CLOS fabric with multipathing (the typical IB setup, FWIW). As Trill and SPB seem pretty dead, it seems the momentum is to do the multipathing at the L3 level, using the aforementioned EPVN+VXLAN+BGP, or something similar.

You really don't need EVPN+VXLAN though. (And if you do need it I recommend finding a way to not need it.)
May I ask what’s IB?
infiniband - The high performance interconnect by Mellanox
Infiniband is actually a standard and Mellanox became the go to supplier. However in the early stages there were many more suppliers.
Intel bought one of the IB players and for all I know killed it.
Intel bought Qlogic, made some proprietary enhancements to their IB tech, which they now sell under the omni-path brand.
By Voltaire, the infiniband company acquired by mellanox :)

Their ethernet accelerator VMA stands for "Voltaire Messaging Accelerator".

IIRC Mellanox was always doing Infiniband; they made the hardware and Voltaire wrote the drivers (before people understood open source and created OFED).
Mellanox made the HCAs (aka network cards) and Voltaire made the switches. Mellanox had their own drivers and their own MOFED (their incompatible fork of openfabrics upstream OFED with Mellanox specific enhancements).
We just pulled a Voltaire switch from our data center. They definitely made hardware. Or at least put their name on it.
A similarly interesting fact is that Starboard Value, the activist fund had tried to get Mellanox join with Marvell another company in its portfolio. But Marvell was rebuffed.

Later Marvell went on to acquire Cavium for 6 billion dollars with aim to build an infrastructure company. Though from the company's latest earnings release it seems that the deal isn't really a good one.

Mellanox is big in high-end Ethernet equipment as well which is growing faster than the Infiniband business, obviously a key technology for cloud providers.
According to https://www.nextplatform.com/2018/11/02/datacenter-25g-ether... the ethernet part of the business is actually already much bigger than the IB business.
Intel already have the high-performance interconnect they acquired from Cray (Aries) and now sell as "OmniPath". Not sure how that sells head-to-head vs. InfiniBand though. IB obviously has a lot more legacy presence in HPC data centers.

Edit: sorry somehow glossed over your mention of OmniPath, didn't mean to restate what you already said.