Hacker News new | ask | show | jobs
by donmcronald 58 days ago
I have an Mini M4 Pro with 64GB of 273GB/s memory bandwidth and it's borderline with 3.5-27B. I assume this one is the same. I don't know a ton, but I think it's the memory bandwidth that limits it. It's similar on a DGX Spark I have access to (almost the same memory bandwidth).

It's been a while since I tried it, but I think I was getting around 12-15 tokens per second an that feels slow when you're used to the big commercial models. Whenever I actually want to do stuff with the open source models, I always find myself falling back to OpenRouter.

I tried Intel/Qwen3.6-35B-A3B-int4-AutoRound on a DGX Spark a couple days ago and that felt usable speed wise. I don't know about quality, but that's like running a 3B parameter model. 27B is a lot slower.

I'm not sure if I "get" the local AI stuff everyone is selling. I love the idea of it, but what's the point of 128GB of shared memory on a DGX Spark if I can only run a 20-30GB model before the slow speed makes it unusable?

2 comments

There are a number of DGX benchmarks for these recent gemma-4 / qwen-3.6 models on the nvidia forum, ex: https://forums.developer.nvidia.com/t/qwen-qwen3-6-35b-a3b-a...
Tbf the Sparks usefulness isn’t for inference IMO. Its memory bandwidth is too low for that.

But on the other hand, running Qwen 3.5 122B A10B locally on it using ~110GB of memory and getting 50tk/s generation and quite excellent prefill… I couldn’t do that on many other machines at this price point

For me this has been awesome to learn CUDA on, fine tuning models (until I get it close to what I want then it’s off to H100 or something clusters) and a bit of inference on the side