One out of only 2 vendors for InfiniBand which is quite important for HPC especially in the Top 500.
They also have some sort of a parallel VLIW CPU architecture that they've been trying to get off the ground for a while now called TILE/TILE64 so that might also play into things.
However since NVIDIA opened their offices in Israel a while ago they might simply be looking for an acquihire since Mellanox is a fabless semi chip maker it kinda fits that also.
Mellanox has driven IB speeds for more than a decade, limited only by PCIe bandwidth. Since they've had NICs that do both IB and Ethernet, they've been driving the ethernet market as well. We've been using their 100G adapters since 2015 (when they were first to market by a big margin). Even today, there are only a handful of vendors that can deliver a 100g NIC. I worry that if Mellanox stops driving port speed, we'll see a slower increase in the speed of NICs due to the lack of competition (eg, 400g will take longer..).
Infiniband hasn't increased in speed in a while, while Ethernet has. IB is all but done since Ethernet 200 and 400Gbps will be out soon, and in fact, are already supported by Mellanox switches.
From what I understand the reason it hasn't increased is because primarily of PCIe, it is set to double in speed with PCIe 4.0 and again with 5.0 once those are made more available with maximum speeds of 1.6 and 4.0 TB for IB x16.
IB still has lower latency than Ethernet at least on paper especially when it comes to RDMA but I don't know how much of an issue that is for these applications.
But overall I'm not sure how much it matters to Mellanox since they are also the ones who are making the high speed Ethernet switches and host adapters.
TILE64 got squeezed at both ends, with GPUs becoming more capable on one side and CPUs getting lots of cores (Threadripper) on the other. The niche just closed up on them.
Tilera was acquired by EZChip in 2014 and then in turn EZChip were acquired by Mellanox - the Tile arch kind of died from lack of attention during that process.
I had a TileGX dev board and ported our product at the time (nearly 10 years ago). It was an ok arch but that’s a tough niche to fight for.
Don't think so, but they do have like 3000 employees mainly in Israel and design a wide range of ASICs I doubt that they got them for the InfiniBand and their interconnect business alone.
Oh. I've known about TILE64 back when Tilera created it (I spoke to their VP of something or other on the phone once, I worked in telecoms at the time and we were interested in seeing if their hardware would help accelerate something we worked on. We never went forward with it though), I hadn't realised Mellanox owned it now. Looking at wikipedia, Tilera was bought by EZChip, which was bought by Mellanox.
Could be, but their mesh fabric might be useful for some multi-GPU configurations especially if NVIDIA goes into chiplets, I don't know if it's better than NVLINK or not but since NVLINK looks to be pretty much PCIe with a lot of the overhead stripped out of it it just might be.
Can confirm, running 100g mellanox and getting ready to move to 200. They are the best game in town for the price point to performance/reliability/support ratio.
I use Mellanox ConnectX-5 Dual QSFP+ 100Gbit Ethernet cards in my OpenStack private cloud at my business. Mellanox has been instrumental in running flash Ceph arrays by pushing the speed envelope beyond what the Intel/Cisco's of the world are doing.
Mellanox has also embraced bringing RDMA to things like Ceph and working with the broader vendor ecosystem like Red Hat for using this in production.
I hope Nvidia doesn't taint the good reputation of this company.
They still are the company to go for infiniband, but infiniband it lost much of its appeal to non true supercomputing tasks.
Ethernet nowadays can do RDMA, soft guarantees on latency, in-order and reliable delivery at lower costs, and an option to reuse existing L2 networks. Mellanix has squeezed the infiniband cow dry.
> Ethernet nowadays can do RDMA, soft guarantees on latency, in-order and reliable delivery at lower costs, and an option to reuse existing L2 networks. Mellanix has squeezed the infiniband cow dry.
And who did the Ethernet RDMA protocol (RoCEv1/v2) and sell the RDMA compatible NIC that everyone use in the HPC world currently ?
RDMA is an interesting point because QLogic was doing some stuff then Cavium bought them then Marvell bought them. And then there's the Emulex-Avago-Broadcom chain. The entire market is converging into a few major players.
And yet an ethernet frame, by design, is larger than an infiniband frame (think layer 2). When it comes down to node to node latency, given perfectly equal silicon, infiniband will still be faster.
I think the minimum size of an IB packet with no payload is 26 octets, vs. 64 octets for an eth packet. So sure, a difference of 38 octets, but at, say, 100 Gbit/s, that's less than a nanosecond difference, much much less than the IB vs. ethernet latency difference. So I think you'll have to look somewhere else for information.
I have no idea what it is, actually. Some ideas that may or may not matter (or might not even be correct):
- IB is a couple of decades younger, so could benefit from knowledge how to do fast protocols. (Not an explanation per se)
- Simpler forwarding. In IB the subnet manager gives out the LID's that are used for routing withing a subnet. They are shorter than an eth MAC (16 vs. 48 bits), so the lookups circuit in the switches can be smaller and faster(?), and also since the LID's are assigned by the subnet manager rather than being burned at the factory, they can be distributed taking into account the subnet topology, allowing switches to use LID Mask Count (LMC) filtering. Similarly, all routes within a subnet are calculated statically a priori by the subnet manager (load balancing among multiple paths is only static round robin, not dynamical load dependent), and don't have to be calculated on the fly by the switches.
- FEC rather than retransmission in case of corruption.
Sure, IB is simply is a superior fabric for its niche.
For everything else, RDMA on Ethernet buys you with ability to reuse your L2, and this matters way way more to people running DC businesses than anything else.
They also have some sort of a parallel VLIW CPU architecture that they've been trying to get off the ground for a while now called TILE/TILE64 so that might also play into things.
However since NVIDIA opened their offices in Israel a while ago they might simply be looking for an acquihire since Mellanox is a fabless semi chip maker it kinda fits that also.