| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by latchkey 240 days ago
	Your complaint sounds more like the way that you have to access the HPC (via slurm), not the compute itself. After having now tried slurm myself, I don't understand the love for it at all. As for debugging, that's where you should be allowed to spin up a small testing cluster on-demand. Why can't you do that with your slurm access?

3 comments

enum 240 days ago

I’m not complaining. The clusters are great. The non-Slurm H100s are great. The Spark is more fun.

link

latchkey 239 days ago

What makes it more fun?

link

enum 239 days ago

I think that personal computing is more fun than time-shared computing. :)

It's remarkable what can now be done on a whisper-quiet little box. I hope the Strix Halo's will be just as much fun, and they should be, so long as Flash Attention works.

link

latchkey 239 days ago

Haven't tried to compile it for SH, but did compile it for MI355x and it worked. LONG compile time though ninja sure helped.

Fair, thanks for the answer.

The bane of my existence...

  salloc: Granted job allocation 1978
  salloc: Waiting for resource configuration

link

moondev 239 days ago

You can't attach a monitor to a h100, it has no video out.

Even ignoring GPU details spark is an awesome little quiet powerhouse arm64 workstation that is 100% Linux first

link

latchkey 238 days ago

At least on our offerings (MI300x), we offer console and even iDrac bios access (bare metal) and it is all running Ubuntu.

link

moondev 238 days ago

Sure what I mean is a physical monitor is more fun than a virtual console.

Curious though how you offer idrac to customer, do you have another OOB BMC for the idrac? Or is this internal engineering context

link

latchkey 238 days ago

I really don't understand the difference. Either way, it is just a window into a computer. ¯\_(ツ)_/¯

We rent bare metal on-demand and our whole business is to be able to offer compute that you probably wouldn't be able to host in your house $, as if you own it yourself.

So, we made it so that users can get access into the BMC and modify the box however they want. When they are done, we've automated the reset as well. Fully self-service.

$ These boxes are very expensive, weigh 350lbs, sound like a jet engine and consume ~10kW.

link

yunohn 239 days ago

100% - slurm is aimed at job maintenance and resource management on HPC clusters. Thus being a pain in the ass for the kind of fast adhoc iteration and testing that AI/ML requires.

link

mbreese 239 days ago

Unless you can submit an interactive slurm job and get exclusive access to an H100 for a few hours of dedicated time. If the cluster is overloaded, it’s hard to get those to run when you’d like, but there are still ways. But you do have to be patient.

But it’s still not quite like exclusive access to resources when you want them. So I can see it from both ways.

link

pinewurst 239 days ago

The love for Slurm comes from experience with other, older HPC batch schedulers which were/are obliquely worse in so many ways.

link