Hacker News new | ask | show | jobs
by RyeCatcher 236 days ago
I absolutely love it. I’ve been up for days playing with it. But there are some bleeding edge issues. I tried to write a balanced article. I would highly recommend for people that love to get their hands dirty. Blows away any consumer GPU.
5 comments

+1

I have H100s to myself, and access to more GPUs than I know what to do with in national clusters.

The Spark is much more fun. And I’m more productive. With two of them, you can debug shallow NCCL/MPI problems before hitting a real cluster. I sincerely love Slurm, but nothing like a personal computer.

Your complaint sounds more like the way that you have to access the HPC (via slurm), not the compute itself. After having now tried slurm myself, I don't understand the love for it at all.

As for debugging, that's where you should be allowed to spin up a small testing cluster on-demand. Why can't you do that with your slurm access?

I’m not complaining. The clusters are great. The non-Slurm H100s are great. The Spark is more fun.
What makes it more fun?
I think that personal computing is more fun than time-shared computing. :)

It's remarkable what can now be done on a whisper-quiet little box. I hope the Strix Halo's will be just as much fun, and they should be, so long as Flash Attention works.

Haven't tried to compile it for SH, but did compile it for MI355x and it worked. LONG compile time though ninja sure helped.

Fair, thanks for the answer.

The bane of my existence...

  salloc: Granted job allocation 1978
  salloc: Waiting for resource configuration
You can't attach a monitor to a h100, it has no video out.

Even ignoring GPU details spark is an awesome little quiet powerhouse arm64 workstation that is 100% Linux first

At least on our offerings (MI300x), we offer console and even iDrac bios access (bare metal) and it is all running Ubuntu.
100% - slurm is aimed at job maintenance and resource management on HPC clusters. Thus being a pain in the ass for the kind of fast adhoc iteration and testing that AI/ML requires.
Unless you can submit an interactive slurm job and get exclusive access to an H100 for a few hours of dedicated time. If the cluster is overloaded, it’s hard to get those to run when you’d like, but there are still ways. But you do have to be patient.

But it’s still not quite like exclusive access to resources when you want them. So I can see it from both ways.

The love for Slurm comes from experience with other, older HPC batch schedulers which were/are obliquely worse in so many ways.
> Blows away any consumer GPU.

Nah. Do you have 1st hand experience with Strix Halo? At less than 1600€ for a 128GB configuration it manages >45 tokens/s with gpt-oss 120b. Which is faster than DGX Spark at a fraction of the cost.

Strix Halo has awful token prefill speed. Only suitable for very small contexts.
One thing I can’t find anyone mention in reviews - does inference screech to a halt when using large context windows on models? Say if you’re in the 100k range on gpt-oss. I’m not concerned about lightning inference speed overall as I understand the purpose of the spark is to be well rounded / trainer tuner. I just want to know if it becomes unusable vs reasonable slowdown at larger contexts. That’s the thing people are unpleasantly surprised to find about a Mac Studio which has prevented me from going that route.
Thanks for this bleeding edge content!

But please have your LLM post writer be less verbose and repetitive. This is like the stock output from any LLM, where it describes in detail and then summarizes back and forth over multiple useless sections. Please consider a smarter prompt and post-editing…

I agree whole-heartedly. Two thirds of the article read like slop.
Since the text is obviously LLM output, how much prompting and editing went into this post? Did you have to correct anything that you put into it that it then got wrong or added incorrect output to?
Definitely reeks of someone who doesn't know what makes a readable blogpost and hoped the LLM did.

I was not familiar with the hardware, so I was disappointed there wasn't a picture of the device. Tried to skim the article and it's a mess. Inconsistent formatting and emoji without a single graph to visualize benchmarks.

I read the whole thing now and it's filled with slop. I don't really care about the emojis and the marketing voice too much. I do care that it's impossible to tell what the author cared about what they didn't, or if any of it is made up or extrapolated.

I bet the input to the LLM would have been more interesting.

> Training Performance is Real (When It Works)

It looks like it worked? Why's it say this?

> Verdict: Inference speed scales proportionally with model size.

Author only tried one model size and it's faster than NVIDIA's reported speed at a larger model. Not really a "Verdict".

> Verdict: 4-bit quantization is production-viable.

That's not really something you can conclude from messing around with it and saying you like the outputs.

> GPU Inference is Fundamentally Broken

Probably not? It probably just doesn't work in llama.cpp right now? Takes a while reading this to work out they tried ollama and then later llama.cpp, which I'd guess is basically testing llama.cpp twice. Actually I don't even believe that, I'm sure author ran into errors that might be a pain to figure out, but there's no evidence it's worse than that.

But then it says this is the "root cause":

    ARM64 + Blackwell + CUDA 13.0 = Bleeding Edge
    ↓
    Limited production testing
    ↓
    Edge cases in numerical precision (inference)
    ↓
    Memory management issues (training)
Am I to believe GPU inference is really fundamentally broken? I'm not seeing the case made here, just claims. At this point the LLM seems to have gotten confused about whether it's talking about the memory fragmentation issue or the GPU inference issue. But it's hard to believe anything from this point on in the post.