| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fock 1970 days ago

hmm, I'd like to digress

>- Lots of different kinds of nodes

well, that's not a problem of slurm (which will happily start your process on all nodes), but of typical MPI programming. And once you are running something computationally intensive over multiple nodes today, you are still using MPI.

>- anything more complex dependency wise than a handful of shared Conda envs

you can put whatever dependencies you want on your NFS (or copy them to your node). If you're running on a single node it behaves 100% like running with a special login shell on os XYZ, so I don't know what problems happen with dependencies. The main problem would be that it doesn't include any "service discovery" beyond OpenMPI.

>- anything involving docker

have not used it, but there's enroot/singularity. The first of which is apparently dogfooded at Nvidia. Probably might need some adjustements for bases images (because MPI)... As I don't know about the policy within these 5k+ cloud companies: can employees just execute any random image from dockerhub there? This seems a little dangerous...

> anything vaguely untrusted

linked to the docker case? Does kubernetes reboot nodes then? Slurm can do this. And while classical Slurm use cases definitely require a shared account (because of the shared fs), slurm should afaik merrily execute your programs even without any shared account than slurm. You can attack this obviously, but so you can attack kubernetes and while it gets more scrutiny it's also a byzantine collection of FANG-style requirements.

EDIT: What you can't work around is Slurm needing a comms-channel back to the controller, which you though could just firewall off (jobs don't use Slurm to communicate...). As each job can execute a Prolog-script, you can even only selectively allow traffic to flow between allocated nodes quite simply.

>- any kind of partitioning worse than 3 nines e.g. connectivity or uptime instability

that's indeed the case

>- anything more complex than 3-5 priority levels of scheduling

what kind of scheduling does kubernetes implement? I guess you could write a plugin for slurm doing that

> It's great if you hit that niche but it frankly struggles with the complexities of even moderately heterogeneous work loads.

except that your points didn't pertain to this (except maybe for the dependencies, if you think about actual service-dependencies), I fully agree

1 comments

kortex 1970 days ago

All very good points!

> you can put whatever dependencies you want on your NFS (or copy them to your node).

This is exactly what we do currently. For non controlled data, this works. However this gets really thorny when you involve CUI (confidential unclassified information), precisely because of mentioned shared fs.

Both SLURM and Kube let you write schedulers but just getting SLURM to talk to the DB was a tough affair, some very poorly documented bugs were at play.

I haven't been on this project in a bit so I don't recall the exact details. And maybe it's lack of familiar with SLURM. But I definitely felt hobbled by it. We are probably going to something based off of Hashicorp stuff.

link

fock 1969 days ago

yes, I guess you are still using NFSv3? We (really tiny vs. everyone else here) settled on that as well, because it requires less integration overall. Though if you're going the all-AD-route, there's the auks-plugin for running with NFSv4 (not sure, how long ticket renewal works though). And you can always just sbcast a zip of your tree and completely forego the NFS (if you store your data somewhere else. Normally you should also be able to write GRES-plugins to "share" this ressources.

link