Yeah unfortunately it still sucks. Actually to be fair it's probably fine for its intended use case: researchers interactively running one-off batch jobs on a university HPC cluster.
But I work in silicon and every company I've worked in uses SGE/SLURM for automated testing. SLURM absolutely sucks for that. They really want you to submit jobs as bash scripts, they can't handle a large number of jobs without using janky array jobs, submitting a job and waiting for it to finish is kind of janky. Getting the output anywhere except a file is difficult. Nesting jobs is super awkward and buggy. All the command line tools feel like they're from the 80s - by default the column widths are like 5 characters (not an exaggeration).
We even had an issue that SLURM uses 4 ports per job for the duration of the job, so you can't actually run more than a few thousand jobs simultaneously because the controller runs out of TCP ports!
I don't think it would actually be that hard to write a modern replacement. The difficult bit is dealing with cgroups. I won't hold my breath for anyone in the silicon industry to write it though. Hardware engineers can't write software for shit.
> We even had an issue that SLURM uses 4 ports per job for the duration of the job, so you can't actually run more than a few thousand jobs simultaneously because the controller runs out of TCP ports!
That sounds concerning. Do you have a link to a bug report for this please? Is the tcp port problem on the compute node side or the controller side?