Hacker News new | ask | show | jobs
by latchkey 240 days ago
Your complaint sounds more like the way that you have to access the HPC (via slurm), not the compute itself. After having now tried slurm myself, I don't understand the love for it at all.

As for debugging, that's where you should be allowed to spin up a small testing cluster on-demand. Why can't you do that with your slurm access?

3 comments

I’m not complaining. The clusters are great. The non-Slurm H100s are great. The Spark is more fun.
What makes it more fun?
I think that personal computing is more fun than time-shared computing. :)

It's remarkable what can now be done on a whisper-quiet little box. I hope the Strix Halo's will be just as much fun, and they should be, so long as Flash Attention works.

Haven't tried to compile it for SH, but did compile it for MI355x and it worked. LONG compile time though ninja sure helped.

Fair, thanks for the answer.

The bane of my existence...

  salloc: Granted job allocation 1978
  salloc: Waiting for resource configuration
You can't attach a monitor to a h100, it has no video out.

Even ignoring GPU details spark is an awesome little quiet powerhouse arm64 workstation that is 100% Linux first

At least on our offerings (MI300x), we offer console and even iDrac bios access (bare metal) and it is all running Ubuntu.
Sure what I mean is a physical monitor is more fun than a virtual console.

Curious though how you offer idrac to customer, do you have another OOB BMC for the idrac? Or is this internal engineering context

I really don't understand the difference. Either way, it is just a window into a computer. ¯\_(ツ)_/¯

We rent bare metal on-demand and our whole business is to be able to offer compute that you probably wouldn't be able to host in your house $, as if you own it yourself.

So, we made it so that users can get access into the BMC and modify the box however they want. When they are done, we've automated the reset as well. Fully self-service.

$ These boxes are very expensive, weigh 350lbs, sound like a jet engine and consume ~10kW.

100% - slurm is aimed at job maintenance and resource management on HPC clusters. Thus being a pain in the ass for the kind of fast adhoc iteration and testing that AI/ML requires.
Unless you can submit an interactive slurm job and get exclusive access to an H100 for a few hours of dedicated time. If the cluster is overloaded, it’s hard to get those to run when you’d like, but there are still ways. But you do have to be patient.

But it’s still not quite like exclusive access to resources when you want them. So I can see it from both ways.

The love for Slurm comes from experience with other, older HPC batch schedulers which were/are obliquely worse in so many ways.