| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by IshKebab 203 days ago
	I wish there was a sane modern alternative to SLURM. Futile hope though. My company is still using SGE.

2 comments

linksnapzz 202 days ago

You can pay for LSF; which is older than SLURM, but IMHO more reliable under load....

link

siliconpotato 202 days ago

Slurm is the modern alternative. We are using SGE too and slurm feels like the future.

link

IshKebab 202 days ago

Yeah unfortunately it still sucks. Actually to be fair it's probably fine for its intended use case: researchers interactively running one-off batch jobs on a university HPC cluster.

But I work in silicon and every company I've worked in uses SGE/SLURM for automated testing. SLURM absolutely sucks for that. They really want you to submit jobs as bash scripts, they can't handle a large number of jobs without using janky array jobs, submitting a job and waiting for it to finish is kind of janky. Getting the output anywhere except a file is difficult. Nesting jobs is super awkward and buggy. All the command line tools feel like they're from the 80s - by default the column widths are like 5 characters (not an exaggeration).

We even had an issue that SLURM uses 4 ports per job for the duration of the job, so you can't actually run more than a few thousand jobs simultaneously because the controller runs out of TCP ports!

I don't think it would actually be that hard to write a modern replacement. The difficult bit is dealing with cgroups. I won't hold my breath for anyone in the silicon industry to write it though. Hardware engineers can't write software for shit.

link

siliconpotato 201 days ago

> We even had an issue that SLURM uses 4 ports per job for the duration of the job, so you can't actually run more than a few thousand jobs simultaneously because the controller runs out of TCP ports!

That sounds concerning. Do you have a link to a bug report for this please? Is the tcp port problem on the compute node side or the controller side?

link

IshKebab 201 days ago

The controller side. I don't think it is a bug; that's just how they designed it.

They want you to use array jobs for large jobs, or submit jobs in a fire-and-forget way.

link