100% - slurm is aimed at job maintenance and resource management on HPC clusters. Thus being a pain in the ass for the kind of fast adhoc iteration and testing that AI/ML requires.
Unless you can submit an interactive slurm job and get exclusive access to an H100 for a few hours of dedicated time. If the cluster is overloaded, it’s hard to get those to run when you’d like, but there are still ways. But you do have to be patient.
But it’s still not quite like exclusive access to resources when you want them. So I can see it from both ways.
But it’s still not quite like exclusive access to resources when you want them. So I can see it from both ways.