| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by enum 237 days ago

I have H100s to myself, and access to more GPUs than I know what to do with in national clusters.

The Spark is much more fun. And I’m more productive. With two of them, you can debug shallow NCCL/MPI problems before hitting a real cluster. I sincerely love Slurm, but nothing like a personal computer.

1 comments

latchkey 237 days ago

Your complaint sounds more like the way that you have to access the HPC (via slurm), not the compute itself. After having now tried slurm myself, I don't understand the love for it at all.

As for debugging, that's where you should be allowed to spin up a small testing cluster on-demand. Why can't you do that with your slurm access?

link

enum 237 days ago

I’m not complaining. The clusters are great. The non-Slurm H100s are great. The Spark is more fun.

link

latchkey 237 days ago

What makes it more fun?

link

enum 237 days ago

I think that personal computing is more fun than time-shared computing. :)

It's remarkable what can now be done on a whisper-quiet little box. I hope the Strix Halo's will be just as much fun, and they should be, so long as Flash Attention works.

link

latchkey 236 days ago

Haven't tried to compile it for SH, but did compile it for MI355x and it worked. LONG compile time though ninja sure helped.

Fair, thanks for the answer.

The bane of my existence...

  salloc: Granted job allocation 1978
  salloc: Waiting for resource configuration

link

moondev 236 days ago

You can't attach a monitor to a h100, it has no video out.

Even ignoring GPU details spark is an awesome little quiet powerhouse arm64 workstation that is 100% Linux first

link

latchkey 236 days ago

At least on our offerings (MI300x), we offer console and even iDrac bios access (bare metal) and it is all running Ubuntu.

link

moondev 236 days ago

Sure what I mean is a physical monitor is more fun than a virtual console.

Curious though how you offer idrac to customer, do you have another OOB BMC for the idrac? Or is this internal engineering context

link

yunohn 237 days ago

100% - slurm is aimed at job maintenance and resource management on HPC clusters. Thus being a pain in the ass for the kind of fast adhoc iteration and testing that AI/ML requires.

link

mbreese 236 days ago

Unless you can submit an interactive slurm job and get exclusive access to an H100 for a few hours of dedicated time. If the cluster is overloaded, it’s hard to get those to run when you’d like, but there are still ways. But you do have to be patient.

But it’s still not quite like exclusive access to resources when you want them. So I can see it from both ways.

link

pinewurst 237 days ago

The love for Slurm comes from experience with other, older HPC batch schedulers which were/are obliquely worse in so many ways.

link