| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ulrikhansen54 1151 days ago
	More powerful chips are great, but NVIDIA really ought to focus some of their best folks on ironing out some of the quirks of using their CUDA software and actually getting stuff to run on their hardware in a simpler manner. Anyone who's ever fiddled with various CUDA device drivers and lining up PyTorch & Python versions will understand the pain.

3 comments

IceWreck 1151 days ago

The solution is to not install CUDA on your base system because you need multiple versions of CUDA and some of them are often incompatible with your distro provided GCC.

Here is what works for me:

- Nvidia drivers on base linux system (rpmfusion/fedora in my case)

- Install nvidia container toolkit

- Use a cuda base container image and run all your code inside podman or docker

link

codethief 1151 days ago

I admit it's been a while (2 years) since I last played with Nvidia/CUDA (on Jetson) and back then running CUDA inside Docker was still somewhat arcane, but in my experience, whatever the Nvidia documentation lays out works well until you want to 1) cut down on container image size (important for caches and build pipelines) and, to this end, understand what individual deb packages and libraries do, 2) run the container on a system different from the official Nvidia Ubuntu image.

Back then the docs were just awful. Has this really changed that much in recent times?

link

shaklee3 1150 days ago

Containers have always come in different flavors that represent their sizes and capabilities. For example, runtime containers have the bare minimum to get the application running but none of the debug tools.

link

ulrikhansen54 1149 days ago

The docs are still terrible, coupled with AWS / GCP docs around these things it makes it near impossible to get this stuff to work without investing a significant amount of time.

link

thangngoc89 1151 days ago

Pytorch is the most painless one because everything is bundled in the wheel. Latest stable CUDA supported by PyTorch is 11.8 and I have been running it on a CUDA 12.0 machine because CUDA is backward compatible. Tensorflow on the other hands, requires compilation with the installed CUDA library and it’s truly a pain since I can’t change the machine’s CUDA version.

link

omgJustTest 1151 days ago

Hardware before software!

link

paulryanrogers 1151 days ago

ATI/AMD GPUs supposedly have great hardware, hamstrung by less-than-great software. In fact it's the lack of some software features making me hesitate to switch despite major cost savings.

link

mrguyorama 1151 days ago

AMD drivers are fine if you only care about gaming. There's the occasional idiocy like the default fan curve for my graphics card refusing to run higher than 70% so that the card will cook itself if you actually use it and hard crash your system or the driver, but eh.

The real problem is that ROCm is a fucking joke, pathetic, half assed, pretend project. Nobody with power in AMD seems to care that nobody can learn machine learning on their hardware to push it in other places, or that their GPUs that they have recently spent all this time boasting about their higher VRAM which is literally useless unless you want to play poorly optimized AAA titles ported from the PS5.

People say it works but you basically have to be one of the engineers who wrote it to prove that. Good luck getting it to work with Windows, or any hardware that wasn't purpose built for a cluster partner. It's so stupid. Maybe they genuinely intended to make a real CUDA competitor but noticed the ways that nVidia then had to artificially segment their market through dumb decisions (the VRAM) and bios hacks that didn't work and just gave up on that path.

link

kapperchino 1151 days ago

In fact the amd fine wine is just them fixing their drivers from launch

link