Hacker News new | ask | show | jobs
by ktpsns 2659 days ago
Interesting. Since Mellanox is a big player in the HPC world, this means Nvidia wants to get more serious there. Due to Nvidia's bad Linux support and pricing (compared to AMD), I know quite a number of academic computing centers which like Mellanox hardware but avoid Nvidia hardware like the plague.
6 comments

Modern HPC is being done with the Nvidia toolkits.

Sure, the Nvidia driver is closed-source and a pain to work with for OS developers, but for the use-cases it's designed for (CUDA etc), it's far and away the best-in-class on Linux.

To my knowledge, there are no systems in the TOP500 running AMD chips or GPUs. Intel has some competition in the CPU space (POWER series, some ARM, etc) but if GPUs are in those systems, they're Nvidia.

> Sure, the Nvidia driver is closed-source and a pain to work with for OS developers, but for the use-cases it's designed for (CUDA etc), it's far and away the best-in-class on Linux.

It's unfortunately a sad truth.

CUDA won, And is now the de-factor standard for almost every application that run over GPU. Nvidia succeeded to jailed the entire HPC community to their bloated, badly maintained crappy software stack and this is very regrettable.

Any admin / integrator that had to deal with NVidia bloatwares under Linux hate it, and for very good reasons.

My GTX 1080 works flawlessly with Linux, as has any other NVIDIA graphics card I've ever owned (GTX 680, 480). The only time I tried an AMD card it was a complete dumpster fire, nothing worked (the open source driver at the time sucked and the proprietary driver wouldn't install properly). I bought the AMD card based on the myth that AMD has better linux support...
Ahh what are you talking about? That "myth" didn't exist until the open source drivers really started working.

Biggest issue with AMD on Linux right now is that they sometimes seem to forget to fully enable support in patches before release. Like the RX 590 had to have firmware updates post release because they forgot to do everything I guess.

nVidia GPUs were always recommended over AMD because their support was significantly better before the mainoine Radeon/Radeon si/amdgpu drivers really started being great. nVidia will still run better now, but the benefit of the open source driver ecosystem out weighs that for me.

I am still waiting that benefit to provide the missing OpenGL and hardware video acceleration features that were never ported from fxglr.
I think most user's complaint is that their drivers aren't open source and until recently were a pain to install. My 1050ti has also worked pretty much flawlessly, but I wish they would open source their drivers and make it easier on the linux developers.
Trying to get a 1070 set up w/ 2 monitors on a laptop with hybrid graphics is a nightmare. 1 display driven by intel, 1 by nvidia. Cannot get both screens working without 2 Xscreens. Xinerama wouldn't work w. proprietary drivers. Nouveau has like no support for like 1050 up.

Wanted to try out SwayWM, but they don't work around how nvidia handles things in comparison to what everyone else does.

Works in Ubuntu, but could not for the life of me get 2 monitors working in Arch.

Maybe just maybe this is not the fault of Nvidia but due to the fact that large parts of the Linux ecosystem are fragile, time consuming to configure and break if you look at them in the wrong way. Professional linux distributions like Ubuntu paper over a lot of that fragility, whereas in Arch Linux you can easily burn days getting basic functionality to work (multiple sound cards come to mind) only for it to break with the next update. And yes I speak from experience, I used Arch Linux ~5 years basically for the fun of doing everything by yourself, because Ubuntu felt too restrictive and opaque.
> Maybe just maybe this is not the fault of Nvidia but due to the fact that large parts of the Linux ecosystem are fragile, time consuming to configure and break if you look at them in the wrong way.

How it works with open source drivers is that you main-line your drivers so that the kernel maintainer maintain the drivers for you, for free. Choosing to keep your drivers closed source means committing to keeping your drivers up-to-date with changes in the kernel, or writing an open-source shim that does that. Which approach is more "fragile?"

I'm sorry but this is not how it works. No-one maintains your drivers for free. The contributors to drivers in the Linux kernel are usually employed by the companies that make the product. In addition they might have to deal with subsystem maintainers that treat their part of the code base as their personal fiefdom. Just follow the hoops that the AMD developers had to jump through so that their driver components got accepted into the kernel. Basically they were told they were doing it all wrong and should really be using abstractions already in place, which were probably developed by Intel in order to get their graphics stack to work (https://lists.freedesktop.org/archives/dri-devel/2016-Decemb...). Imagine being treated like that while you also need to support two much larger platforms (consoles, windows and ideally a simulation backend for hardware development). Nvidia wisely decided to develop one driver stack to cover all of these platforms, while as far as I know AMD split their effort for a long time. Nvidia also had by far the highest quality OpenGL implementation in place (not sure about Vulkan).

Graphics card drivers happen to be very complex beasts, somewhere along the stack they do need a compiler for multiple custom and often proprietary architectures, which most definitely no-one will maintain for free for you. There is a huge incentive to keep most of the code platform independent and only maintain a minimal kernel specific component. This part of the code base is a comparatively trivial part, essentially the kernel should get out of the way as much as possible. The kernel specific abstractions like KMS are examples of such comparatively trivial things.

> How it works with open source drivers is that you main-line your drivers so that the kernel maintainer maintain the drivers for you, for free.

Not sure what do you mean by "maintain for free". Full coverage testing is not free.

Try doing a PCI passthrough to virtualizes that 1080 onto a VM. Then try the same thing with AMD. Sure, the AMD drivers are relatively terrible, but at least they are open source and don't kick you in the knees when you are doing something you ought to be able to do.
Well this is not a supported feature even on Windows (AFAIK), so I'm not sure why this is an issue for you? See here for a list of supported graphics cards https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/ind.... Notice how there is no distinction between Windows and Linux. The 1080 is supported perfectly by nvidia-docker though.
There's no reason the hardware can't do it, the driver actively attempts to see if you're on a VM and refuses to talk to the hardware if so. Google "Nvidia error code 43" if you're curious, and some people like to virtualize a video game on their Linux dev PC.
This is not a purchase for gaming -- this is for the HPC market. Nvidia drivers on Linux for HPC work really well. Academia is a tiny, tiny fraction of the market.
nVidia's Tesla cards work relatively painless in HPC environments. When you install a supporting driver and set the cards' persistence mode to your needs, the rest is generally automagic.

However, they're hot and need serious juice to run, so you cannot just shove 36 of them to a rack and just power them on.

I had the opposite experience.
maybe they will finally opensource their gpu and cuda drivers and remove all the headaches from my life
I too wish dreams came true...