Hacker News new | ask | show | jobs
by ChuckMcM 2661 days ago
I did not see that coming.

At NetApp we were an early customer of Mellanox (I told the founder that their name sounded like a poison gas :-)) which Steve Kleiman claimed implemnted Infiniband in anger. It was a good technology for the clustering team. Later as they grew and diversified into ethernet switches we bought a couple of their big core switches at Blekko. And at the current company we use their 40g network adapters to connect to high speed SDR hardware.

So now they are going to be part of Nvidia.

I get that this helps Nvidia in being more data center centric, but does it help them build better machine learning architectures? It does seem to be the only system that benefits from custom hardware more than the cost of that hardware. It seems that loosely coupled shared nothing clusters are not good machine learning back ends.

2 comments

State of the art deep learning models are becoming larger and larger and at some point it makes sense to distribute them over multiple GPUs because they would not fit into a single GPU's memory. At the same time training can be sped up dramatically by blowing up the mini batch size in a synchronized training regime, again requiring multiple GPUs. So the trend is towards "model parallelism" and "data parallelism" at the same time. Once you need more GPUs than you can put on a single PCI Express bus, you need a fast interconnect between servers. Infiniband seems to be the best solution at this time. Nvidia GPUs can already communicate ridiculously fast with remote GPUs via RDMA if there is an Infiniband connection. It makes a lot of sense for Nvidia to push into this direction to provide integrated solutions.
Nvidia GPUs already support RDMA directly from GPU to Infiniband NIC, bypassing the host completely. 100G is current, 200G is in the works. For lots of GPU workloads that don't rely on lots of little kernel launches, having a completely remote GPU is not out of the question.

If they want to continue building large GPU accelerated workloads, pairing more tightly with networking seems like an obvious move.