Condor and the like are for independent jobs "throughput computing" but the authors here are using MPI for tightly-coupled jobs. SLURM and Flux are actively-developed schedulers for these kind of jobs.
SLURM hits a nice sweet spot when you have a very traditional cluster: very homogeneous nodes (both hardware and software), standard logins (eg some kind of LDAP/AD), shared NFS files, trusted code. It's an absolute pain when:
- Lots of different kinds of nodes
- anything more complex dependency wise than a handful of shared Conda envs
- anything involving docker
- anything vaguely untrusted
- any kind of partitioning worse than 3 nines e.g. connectivity or uptime instability
- anything more complex than 3-5 priority levels of scheduling
It's great if you hit that niche but it frankly struggles with the complexities of even moderately heterogeneous work loads.
It's also just a bit dated feeling. Even though kube is complex, it's a joy to work with compared to SLURM. Hashicorp is even better imho.
well, that's not a problem of slurm (which will happily start your process on all nodes), but of typical MPI programming. And once you are running something computationally intensive over multiple nodes today, you are still using MPI.
>- anything more complex dependency wise than a handful of shared Conda envs
you can put whatever dependencies you want on your NFS (or copy them to your node). If you're running on a single node it behaves 100% like running with a special login shell on os XYZ, so I don't know what problems happen with dependencies. The main problem would be that it doesn't include any "service discovery" beyond OpenMPI.
>- anything involving docker
have not used it, but there's enroot/singularity. The first of which is apparently dogfooded at Nvidia. Probably might need some adjustements for bases images (because MPI)... As I don't know about the policy within these 5k+ cloud companies: can employees just execute any random image from dockerhub there? This seems a little dangerous...
> anything vaguely untrusted
linked to the docker case? Does kubernetes reboot nodes then? Slurm can do this. And while classical Slurm use cases definitely require a shared account (because of the shared fs), slurm should afaik merrily execute your programs even without any shared account than slurm. You can attack this obviously, but so you can attack kubernetes and while it gets more scrutiny it's also a byzantine collection of FANG-style requirements.
EDIT: What you can't work around is Slurm needing a comms-channel back to the controller, which you though could just firewall off (jobs don't use Slurm to communicate...). As each job can execute a Prolog-script, you can even only selectively allow traffic to flow between allocated nodes quite simply.
>- any kind of partitioning worse than 3 nines e.g. connectivity or uptime instability
that's indeed the case
>- anything more complex than 3-5 priority levels of scheduling
what kind of scheduling does kubernetes implement? I guess you could write a plugin for slurm doing that
> It's great if you hit that niche but it frankly struggles with the complexities of even moderately heterogeneous work loads.
except that your points didn't pertain to this (except maybe for the dependencies, if you think about actual service-dependencies), I fully agree
> you can put whatever dependencies you want on your NFS (or copy them to your node).
This is exactly what we do currently. For non controlled data, this works. However this gets really thorny when you involve CUI (confidential unclassified information), precisely because of mentioned shared fs.
Both SLURM and Kube let you write schedulers but just getting SLURM to talk to the DB was a tough affair, some very poorly documented bugs were at play.
I haven't been on this project in a bit so I don't recall the exact details. And maybe it's lack of familiar with SLURM. But I definitely felt hobbled by it. We are probably going to something based off of Hashicorp stuff.
yes, I guess you are still using NFSv3? We (really tiny vs. everyone else here) settled on that as well, because it requires less integration overall. Though if you're going the all-AD-route, there's the auks-plugin for running with NFSv4 (not sure, how long ticket renewal works though). And you can always just sbcast a zip of your tree and completely forego the NFS (if you store your data somewhere else. Normally you should also be able to write GRES-plugins to "share" this ressources.
The problem with slurm is how it's typically used: ssh into a shared login node with a shared file system, authorization is tightly coupled to linux users on that node, submit jobs with sbatch. Kubernetes deployment feels much more modern and safe.
I have worked with containers + slurm, where the vendor libmpi is injected in the container runtime [1] by a hook, which gives you close to bare metal performance with some container goodness in terms of isolation and deployment.
Slurm should be the answer but it isn't. In our ML environment, it required ML researchers to understand what is going on (more systems knowledge) and no one liked it. The situation devolved to sshing into machines and running jobs. You are right that slurm is a good fit for HPC ... I just don't think DL workloads are exactly that.
One FAANGUAMLetc engineer told me they SSH, Slurm, and track experiments by telling their manager which parameters were best the day before. This was very strange given that this company has a machine learning platform, so either this engineer did not use it, or they did not use it that much.
We were talking about our machine learning platform and taking it for a spin. We do have long-running notebook scheduling[0] but we wanted to be able to watch the notebook's output from multiple devices as it was running, and for it to survive closed tabs or network disconnections, not just get the results once it's done. We also wanted to be able to do that right from the notebook's interface, instead of SSH'ing and all that, as this was tedious and some of our users aren't that comfortable doing that.
- Lots of different kinds of nodes
- anything more complex dependency wise than a handful of shared Conda envs
- anything involving docker
- anything vaguely untrusted
- any kind of partitioning worse than 3 nines e.g. connectivity or uptime instability
- anything more complex than 3-5 priority levels of scheduling
It's great if you hit that niche but it frankly struggles with the complexities of even moderately heterogeneous work loads.
It's also just a bit dated feeling. Even though kube is complex, it's a joy to work with compared to SLURM. Hashicorp is even better imho.