Hacker News new | ask | show | jobs
by aftbit 1 day ago
IMO running local models "well" still requires an expensive hardware investment. You really want 96GB of VRAM on a modern Blackwell arch to run these models with decent KV cache. Trying to run them on a unified memory Mac, an AI Max AMD processor, or a DGX Spark-alike is really just asking for trouble. Prefill kills perf.

If you throw the right GPUs at the problem, they become much better - but still not quite in the realm of Sonnet or DeepSeek 4 Flash, let alone Opus / DeepSeek Pro or Mythos/Fable/GPT-5.5.

Given enough budget, power, and cooling, you can run some pretty good data pipelines, but for code, I think it still makes sense to shell out to an API provider most of the time.

10 comments

For a fraction of the price of 96GB vram, I built a desktop based on a supermicro server mobo and EPYC 9 series CPU, with just under 400GB rdimm ram (approx $4500 all in but this was before the ram price hike). Works really well for serving larger local modals at a decent enough speed (I consider anything more than 10 tokens/second usable and value accuracy over speed).
that costs the same as 210 months (17.5) years of using codex 5.5 while not being near as good.
Yes, if your trust model allows you to use API providers or the big 3, you 100% should. They have better util than anything you self host, so they can be more efficient. On top of that, they're shoveling cash into the fire to try to capture marketshare, so they're offering inference for well below break-even costs.

The main reasons to use local models are:

1. Self-sovereignty & control

2. Data security

3. Offline availability

If none of those apply to you, then you should just use OpenAI or Anthropic.

FWIW I think it might be both.

Ultimately if you skip over the opportunity to play with these models on your own machine you are losing out on a lot of really interesting educational opportunities — it helps make a lot of stuff feel more concrete in a way that only tinkering can.

But then I think once I had an idea of something that I was building against Gemma 4 or Qwen 3.6 I would be looking at openrouter etc., to stabilise it for the next tier of experimentation (and to get back a kind of multi-device access without tailscale/lm link etc.).

Are they good enough to replace what people seem to want to do with Claude? Maybe not. But it's an unparalleled learning opportunity.

Depends what you need the model to do. The recent granite4.1:3b just takes 2GB of memory and is fast. Results are pretty good and support tool calling. Barely a squeak out of the Mac laptop.

Even faster with the MLX builds.

Then when I need more heavy lifting I fire up a larger model.

IMHO the issue isn't the models. I've had OpenClaw give the same results as Claude using open models locally. Slower but does the job. Something that can do optimal model switching is what's needed.

Yeah it 100% depends what you want the model to do. Some tasks, like extraction, summarization, or simple tool calling (e.g. "turn on my desk lamp") are very doable with tiny models. Others, like coding or more advanced agentic workflows can demand much more powerful models. I was thinking from the lens of coding or running _big_ data extraction pipelines (think ~8 billion pages).
> thers, like coding or more advanced agentic workflows can demand much more powerful models.

You can do coding and agentic fine. For coding I use qwen3.6:35b-mlx and agentic granite4.1:3b works fine.

These are the models I use.

- granite4.1:3b

- granite4.1:30b

- gpt-oss:20b

- gpt-oss:120b (less so now)

- mistral-small3.2

- qwen3.6:35b-mlx

There will always be use cases that don't sit on your laptop, but most of what can be done can be done locally, it just requires a good framework to sit on it.

Why do you like gpt-oss-120b less now? What replaced it?
It's very likely to hallucinate. I'm mostly using Gemma 4 31B now when I need something offline. It is a very strong model for its size.
> DGX Spark-alike is really just asking for trouble. Prefill kills perf.

You're right that prefill kills perf, but shrug the GB10 has far more compute than it has memory bandwidth, so prefill isn't it's bottleneck.

I've seen the same, Sparks are great at non time-sensitive tasks. if you can set up a agentic loop that does not require human intervention, you can design around the memory bandwidth limitations
The other benefit is that speculative decoding literally trades compute to make up for low bandwidth, so MTP/EAGLE/DFlash are unreasonably effective on the GB10 IMO, as long as your use case fits it.

I’m getting 40tk/s decode with 1000+tk/prefill with a 198B-A11B model on mine

I thought MTP wasn't very useful on MoE models because the expert overlap for 2 tokens was too small.
Still helps, and Step 3.5/3.7 were specifically trained for MTP (in a weird triple layer/triple head fashion with a kind of unique architecture)

With the currently-in-PR implementation it doubles decode performance for all the tasks I've been testing it against, at in the worst case is still a 35% uplift, so on a box with heaps of compute and not much memory bandwidth, it's worth it in practice

Qwen 3.6 27B performs similarly to sonnet 4.5 (note I said 4.5, not 4.6) when it comes to coding. It runs amazingly well on my PC with a 7900xtx.

It's worse at general tasks, but in the precise domain of coding I actually prefer to use it over my claude subscription because it has 0 latency (and no privacy concerns whatsoever).

If I could just save up $6000 I could sell off my RTX 5090 for $4,000 and buy an RTX 6000 Blackwell Pro Workstation. I can fit models into the 32GB of vram but my context window ends up being tiny for any halfway capable model.
Isn’t the RTX 6000 Blackwell Pro Workstation over $13000 now?
Dang, that’s crazy. Last I checked they were $10,000. It seemed almost attainable to me as a mere mortal just last year. I’m glad I at least got enough vram and ram to play around a little bit with local models before all the prices went bananas.
And rising. It's depressing.
I feel like the claims come from wildly different personas and use cases. A 24gb vram, 5 year old titan run 27b at 30t/s and the results are good. I use sonnet and opus at my day job and they are more capable but I can still get the same out of qwen, I just need to be mindful of ctx
> Trying to run them on a unified memory Mac

> but still not quite in the realm of Sonnet or DeepSeek 4 Flash

these are not mutually exclusive anymore. DS4 has set the bar for me these days. https://github.com/antirez/ds4

someone just put this on my radar yesterday, im about to try this today. how's your experience with it?

me thinks there's a lot of optimization strats we're currently leaving on the table just because the amount of things to explore and test are so expansive. but this one is super interesting targeting metal primarily and zeroing in on one model. instead of a one size fits all llama.cpp im very interested to see if theres a future where super tailor-made variants per model pans out to harnesses that can rapidly switch ultimately providing something akin to sonnet/early opus territory (that's my personal bench mark of good-enough i shall now cancel the hell out of this claude sub)

I'm on the verge of cancelling my anthropic $20 plan since it's come out. On an M5 Max 128GB, hooked up to the pi.dev harness, I get in the neighborhood of 400-450tps prefill and 30-35tps generation. It is imminently usable and at times feels more stable than my previous CC setup. Occasionally there are things it struggles with that I will bounce back over to CC for, but it is highly usable. The future is bright for local models! As a tinkerer, it makes me really happy to have a local setup I can be just as productive in, and not have the token overlords ready to shut me down at any time.
That's DS4 Flash right? How does it feel in intelligence and speed compared to DS4 Flash hosted by Deepseek themselves or another API provider? I've been using API DS4 Flash for a lot of personal projects and have been quite impressed. I've spent $1 on building ~10 toy projects and gotten them all to work within the bounds of what I wanted without having to do much besides guide the model away from dumb loops.
I'm using the DS4 flash IQ2 2-bit quant, per Salvadore's recommendations for my hardware in the repo. I haven't messed with the cloud hosted variant. The only other paid API I have messed with is a $20 Anthropic sub, primarily with whatever the latest version of Sonnet is. For the most part, this local configuration feels on par with that.

With this configuration (set up over the last month) I have been working on Python data processing tools, an internal Svelte 5/SvelteKit data intensive BI app, and some smaller Rust projects. It's been doing really well there.

Anybody tried it on Strix Halo?
That RTX6000Pro you mentioned is $12k.
Yep - I'd say either that or 4x 5090 is a great entry point to running local models "well". Two of them would be even better. If you don't have $12-24k to spend, you can try your hand with tiny models or quants or slow speeds, but it will be a much more painful experience. You're already giving up a lot by dropping down from frontier models - you're giving up even more by trying to squeeze them into little RAM and compute.

Prices will fall in the next few years. Maybe just play with the tiny toy models for now to learn how they work, then keep using API providers until they do.

Not really, Qwen 27b offloads to a decent gaming GPU (RTX 4090 in my case) without needing tons of RAM.
can you give more info? llama.cpp vs vllm? config? i wanna try specifically this model
llama.cpp to get 115 tok/s on RTX 4090 with Qwen3.6-27B. For example in Windows the latest CUDA variant llama-b9678-bin-win-cuda-13.3-x64.zip and Unsloth UD-Q4_K_XL MTP gguf:

llama-server.exe --host 0.0.0.0 --alias "Qwen3.6-27B-MTP" -m "F:\Qwen3.6-27B-UD-Q4_K_XL-MTP.gguf" -c 75000 -ngl 99 --metrics --temp 0.6 --top-p 0.95 --min-p 0.00 --top-k 20 --presence-penalty 0.0 --no-mmap -t 16 --spec-type draft-mtp --spec-draft-n-max 3 --reasoning on -fa on --parallel 1 -lv 4

Note that this does not use kv cache quants as in my case quants offload to CPU and tanks performance. Also keep in mind this almost maxes VRAM usage so any additional browsers or other programs that use VRAM should be closed. For chat go to http://localhost:8080/ and minimize the window to maximize perf as the web page UI draw itself consumes a lot of GPU perf via constant context switching.

Can try bigger than -c 75000 until perf gets lower than 100 tok/s - that means something is off as windows starts paging out memory or other issues. -c 50000 seems sweetspot if running browsers and stuff that consume 2GB VRAM. If wanting more than -c 140000 then likely need to use a bit smaller model quant.

CPU usage should be near zero, maybe 1 core load. If you see 8+ core load then settings are off and something is offloaded to CPU (for example kv cache). GPU load should be about 100%, meaning it utilizes work optimally in this case.

-t 16 can be omitted or set to the amount of physical cores, not important in this dense model that is 100% in GPU.

Can be pushed to 125 tok/s with that model if using --spec-draft-n-max 4 but VRAM usage also increases, so context needs to be smaller.

If speed is not important and want max context length then remove the draft-mtp parameters and also might need to use k and v quants like --cache-type-v q8_0, leave k f16 if possible to keep quality.