Hacker News new | ask | show | jobs
by dirtikiti 66 days ago
"Local AI should be a default, not a privilege: private data, no per-token bill, no vendor lock-in. The hardware to run capable models already sits on desks. The software to run those chips well doesn't."

So figure out how to run it on Vulkan instead of requiring the user to be locked into expensive CUDA cards.

3 comments

So everyone is aware, you can already run Qwen3.5-27B on Vulkan or Apple's hardware. Every major inference engine supports it right now.

This repo is a vibecoded demo implementation of some recent research papers combined with some optimizations that sacrifice quality for speed to get a big number that looks impressive. The 207 tok/s number they're claiming only appears in the headline. The results they show are half that or less, so I already don't trust anything they're saying they accomplished.

If you want to run Qwen3.5-27B you can do it with a project llama.cpp on CUDA, Vulkan, Apple, or even CPU.

This, even on android via termux you can run ollama with gpu accelaration on phone. This works, though milage will vary.
Yes, you can run Qwen on Vulkan or CPU. But you aren't getting 207t/s.

I just find it funny they talk about being vendor locked, and the only thing they support is nvidia.

You can run pretty much every model on Vulkan, including the Qwen MoE models. You can also run pretty much every model on ROCm, Apple Silicon via MLX, and Intel hardware via OpenVINO. Nvidia got there first, but they're no longer clearly dominant in the self-hosting space, simply because of the high cost. I think Apple probably has the lead there, due to unified memory allowing big models to run without multiple big dedicated GPUs, but stuff like Strix Halo with 128GB of unified memory is also pretty much sold out everywhere. There's a lower bound on how small a model can be and still be useful.

Anyway, I don't have any Nvidia hardware, and I've got several local models running and/or training at all times.

Yes, but they're claiming massive generation speed which you won't get on Vulkan. You won't get it on ROCm on that Strix Halo, either.

It's just funny they talk about vendor lock, and they only support nvidia.

Why doesn’t Apple?
Like with all new tech trends, it takes them a hot minute to catch up, but it's highly likely they will (eventually) release some killer platforms for local AI. The shared memory, high bandwidth and power-efficiency of their M chips is a near-ideal architecture. If/when they finally push out the M5-ultra, that could be round one (albeit still not at the best price/performace vs comparable cloud api tokens). A real mass-market killer device for local LLMs is still going to require some remediation of the global DRAM shortages, and maybe the M6/M7 generation.
Apple has Metal, which is already pretty well-integrated in llama.cpp, various Python libs, and mistral-rs & candle. Unpopular opinion, but Vulkan is hot garbage and the definition of "design by committee." There's a reason people still prefer CUDA, whereas most code could likely be programmatically ported anyway.
Vulkan is not Apple.

Metal is Apple's API.

After the steep increase in sales of Mac Studios specifically for LLMs, I'm waiting for Apple to release a frontier level model, optimized for highest end of apple hardware (probably would be hardware locked by a certain neural processor needed (which would then lock the memory config).

The built in Apple Intelligence right now is very small, but even just having a small LLM you know is always there, online, fast and ready makes you think about building app differently. I would love the context to expand from the meager ~4K tokens.