|
Upstream Hydra doesn't build packages with CUDA because it uses a non-FLOSS license. So they are not in the binary cache. You'll end up rebuilding every CUDA-using package every time a transitive dependency is changed. Yeah, I know, pin the world. But you'll still have to build these packages on every machine. So, you have to run your own binary cache. As you see, the rabbit hole gets deep pretty quickly. The only recourse is using the -bin flavors of PyTorch, etc. which will just download the precompiled upstream versions. Sadly, the result will still be much slower than other distributions. First because Python isn't compiled with optimizations and LTO in nixpkgs by default, because it is not reproducible. So, you override the Python derivation to enable optimizations and LTO. Python builds fine, but to get the machine learning ecosystem on you machine, Nix needs to build a gazillion Python packages, since the derivation hash of Python changed. Turns out that many derivations don't actually build. They build with the little amount of parallelism available on Hydra builders, but many Python packages will fail to build because of concurrency issues in tests that do manifest on your nice 16 core machine. So, you spend hours fixing derivations so that they build on many core machines and upstream all the diffs. Or YOLO and you disable unit tests altogether. A few hours/days later (depending on your knowledge of Nix), you finally have a built of all packages that you want, you launch whatever you are doing on your CUDA-capable GPU. Turns out that it is 30-50% slower. Finding out why is another multi-day expedition in profiling and tinkering. In the end pyenv (or a Docker container) on a boring distribution doesn't look so bad. (Disclaimer: I initially added the PyTorch/libtorch bin packages to nixpkgs and was co-maintainer of the PyTorch derivation for a while.) |
I was thinking if it is possible in nixpkgs to create a branch that attempts to create a version match to specific distributions, especially Ubuntu as the ML world is most using it. My idea is to somehow use the deb package information to “shadow” another distribution.
> First because Python isn't compiled with optimizations and LTO in nixpkgs by default, because it is not reproducible. So, you override the Python derivation to enable optimizations and LTO. Python builds fine, but to get the machine learning ecosystem on you machine, Nix needs to build a gazillion Python packages, since the derivation hash of Python changed. Turns out that many derivations don't actually build. They build with the little amount of parallelism available on Hydra builders, but many Python packages will fail to build because of concurrency issues in tests that do manifest on your nice 16 core machine.
I understand your comments including above and the one about CUDA binaries. Just one clarification on the concurrency in tests failure, do you mean it overloads the machine running multi process tests that then tests fail due to assumptions by the package authors?
My main point is that Nix as a system is so incredibly powerful that perhaps there is an ability to “shadow” boring distributions, especially debian based, in some automated way. The we would have the best of both, baseline stability from the distribution and extensibility of nix.