| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by visarga 207 days ago
	Wondering why a $4T company can't afford a smart installation assistant that can auto-detect problems and apply fixes as needed. I wasted too many days chasing driver and torch versions. It's probably the worst part of working in ML. Combine this with Python's horrible package management and you got a perfect combo - like the cough and the stitch.

3 comments

ux266478 206 days ago

I'm wondering how a $4T company got away with shipping the absolute state of the toolchain to begin with. They have total and complete sovereignty on everything on the outside of the OS and PCIe boundaries with a bottomless pool of top class labor. There's no reason it has to be cruftier or more fragile than any other low latency networked computation... and yet here we are. AMD isn't any better. I'm almost interested to see if Intel has done any better with L0, but I highly suspect it suffers from the exact same ecosystem hell problems that plague the other two.

The idea that getting a PCIe FPGA board to crunch numbers is less headache prone than a GPU is laughable, but that's the absurd reality we live in.

link

numbers_guy 207 days ago

They provide containers to cater to those needs: https://catalog.ngc.nvidia.com/search

link

threeducks 207 days ago

After being once again frustrated by the CUDA installation experience, I thought that I should give those containers a try. Unfortunately, my computer did not boot anymore after following the installation instructions for the NVIDIA container toolkit as outlined on the NVIDIA website. Reinstalling everything and following the instructions from some random blog post made it work, but I then found that the container with the CUDA version that I needed had been deprecated.

There were other problems, such as the research cluster of my university not having Docker, but that is a different issue.

link

YetAnotherNick 207 days ago

Containers don't include drivers which is the primary reason for issues.

link

torginus 207 days ago

Containers afair rely on the exact driver version matching between the host system and the container itself.

We were on AWS when we used this so setting up seemed easy enough - AWS gave you the driver, and a matching docker image was easy enough to find.

link

kcb 206 days ago

That's not the case, CUDA containers user space does not have to match the host drivers CUDA capability. The container needs to be the same major version or lower. So a system with a CUDA 13 capable driver should be able to run all previous versions.

For some versions there's even sometimes compat layers built into the container to allow forward version compatibility.

link

fragmede 206 days ago

Just have claude code fix it

link