Hacker News new | ask | show | jobs
by ulrikhansen54 1104 days ago
More powerful chips are great, but NVIDIA really ought to focus some of their best folks on ironing out some of the quirks of using their CUDA software and actually getting stuff to run on their hardware in a simpler manner. Anyone who's ever fiddled with various CUDA device drivers and lining up PyTorch & Python versions will understand the pain.
3 comments

The solution is to not install CUDA on your base system because you need multiple versions of CUDA and some of them are often incompatible with your distro provided GCC.

Here is what works for me:

- Nvidia drivers on base linux system (rpmfusion/fedora in my case)

- Install nvidia container toolkit

- Use a cuda base container image and run all your code inside podman or docker

I admit it's been a while (2 years) since I last played with Nvidia/CUDA (on Jetson) and back then running CUDA inside Docker was still somewhat arcane, but in my experience, whatever the Nvidia documentation lays out works well until you want to 1) cut down on container image size (important for caches and build pipelines) and, to this end, understand what individual deb packages and libraries do, 2) run the container on a system different from the official Nvidia Ubuntu image.

Back then the docs were just awful. Has this really changed that much in recent times?

Containers have always come in different flavors that represent their sizes and capabilities. For example, runtime containers have the bare minimum to get the application running but none of the debug tools.
The docs are still terrible, coupled with AWS / GCP docs around these things it makes it near impossible to get this stuff to work without investing a significant amount of time.
Pytorch is the most painless one because everything is bundled in the wheel. Latest stable CUDA supported by PyTorch is 11.8 and I have been running it on a CUDA 12.0 machine because CUDA is backward compatible. Tensorflow on the other hands, requires compilation with the installed CUDA library and it’s truly a pain since I can’t change the machine’s CUDA version.
Hardware before software!
ATI/AMD GPUs supposedly have great hardware, hamstrung by less-than-great software. In fact it's the lack of some software features making me hesitate to switch despite major cost savings.
AMD drivers are fine if you only care about gaming. There's the occasional idiocy like the default fan curve for my graphics card refusing to run higher than 70% so that the card will cook itself if you actually use it and hard crash your system or the driver, but eh.

The real problem is that ROCm is a fucking joke, pathetic, half assed, pretend project. Nobody with power in AMD seems to care that nobody can learn machine learning on their hardware to push it in other places, or that their GPUs that they have recently spent all this time boasting about their higher VRAM which is literally useless unless you want to play poorly optimized AAA titles ported from the PS5.

People say it works but you basically have to be one of the engineers who wrote it to prove that. Good luck getting it to work with Windows, or any hardware that wasn't purpose built for a cluster partner. It's so stupid. Maybe they genuinely intended to make a real CUDA competitor but noticed the ways that nVidia then had to artificially segment their market through dumb decisions (the VRAM) and bios hacks that didn't work and just gave up on that path.

In fact the amd fine wine is just them fixing their drivers from launch