Hacker News new | ask | show | jobs
by horsawlarway 3 days ago
I want to echo this.

I've been on claude's opus 4.5/6/7 for work for a couple months, and I finally got back to running Qwen A3B 35B... it's incredibly performant and quite capable on semi-reasonable local hardware.

I get ~150 tokens/s on dual nvidia RTX 3090s and can fit the whole 300k context into gpu on a UD-Q4-K-XL quant gguf.

Combined with Pi as a harness, and I'm surprised to find that it feels about as capable as claude did 8 months ago (their 3.x models).

It's not Opus 4.5 levels yet, but it's good enough for a LOT of basic work. I actually downgraded my personal anthropic subscription because Qwen is absolutely fine for implementation work. I still let a better model write a plan, but then I can just switch over to Qwen to implement.

I don't think we're 10 years away from opus 4.5 levels running on cheap consumer hardware. I think we're probably closer to 18 months away, and I suspect it'll be in the 30-60b range, not the 256b range.

PC manufacturers also seem to be betting on local, with a LOT of focus on 64 to 128gb unified RAM machines.

3 comments

I have come at this at a slightly different angle.

I am a fully-burned-out freelancer (in the last couple of years so severely and totally that I thought I had early onset dementia, and I am still not sure I don't). I don't really have an off-ramp to anything else yet, but the sea-change in the industry has been contributing to my feeling that I should knock it on the head.

I must get past broad understanding of AI to deep understanding, but I have to find a way to do this which sits well with freelancer ethics (sustainability, stability, control of destiny).

So I decided I would start out with that operating principle that ultimately this stuff is just going to be local: models will eventually hit some level of practicality for most tasks and technological progress guarantees that they will eventually run on desktops.

I decided to learn how to run models locally properly, see how far I get with opencode (and Pi and Zed experiments), and grow outwards from there to metered models (opencode go, openrouter etc.)

Knowledge first; what can I do that meaningfully changes my outcomes and confidence with no cost and no exposure to sudden change?

I have a secondhand M1 Max (excellent GPU bandwidth), and I am really shocked to find that arguably that level of practicality is already here.

Qwen 3.6 35B can really do a lot. And — not sure if you have tested it — but in some ways I think the Gemma 4 26B is better. Particularly for more commonplace dev tech — it is very knowledgeable about the sort of low-end web dev stack that is most common (Wordpress, PHP, MySQL).

I have been getting 75 tokens/sec with (GGUF) Gemma-4 26B QAT and MTP. (Can't get anywhere close with MLX, for some reason.)

A similar sort of speed with an MLX Qwen 3.6 35B. I have a sneaking suspicion that maybe llama.cpp is now faster than MLX on this older kit so I might try seeing what llama.cpp can do there, too.

Not blazing fast, but fast enough that there are plenty of experiments and small jobs I can do before I even get to using Big Pickle!

How are you running that GGUF, and how many tokens/sec are you getting without MTP? My M1 Max gives me 65 t/s for non-MTP unsloth/gemma-4-26B-A4B-it-qat-GGUF (UD-Q4_K_XL), but with MTP that actually goes down to 56 t/s (at 63% accepted drafts).
Just this guy's assistant running against the official Q4_0 GGUF:

  ./llama-server \     
    -hf google/gemma-4-26B-A4B-it-qat-q4_0-gguf \
    --spec-draft-hf RachidAR/gemma-4-26B-A4B-it-qat-assistant-q4_0-gguf:Q4_0 \
    --spec-type draft-mtp \
    --spec-draft-n-max 3
I hadn't done any really radical testing so I've just had another look.

Without the MTP drafter, it is pretty consistently 75 tokens per second anyway, which is interesting.

With the MTP drafter it reaches well above 95 tokens per second handling the prompt and it will slowly drop to 65 or so with the output tokens as the prediction success rate slowly drops.

But with generated output it seems to me that the predictions are always going to drop dramatically over time.

I think my results here are broadly consistent with what people say about success rates with smaller and sparse models. I am going to test with n-max 4 in agentic situations at some point, and I may see whether it has much impact on the 31B model which is too slow to be practical otherwise.

I have a very unqualified feeling that MTP will matter more in agentic coding because of the larger prompts.

But my biggest issue since I installed it, I think, is that the combination is occasionally messing with markdown generation during thinking, and sometimes possibly losing the </think> at the end. I've seen it enough now to be fairly sure it is the Gemma MTP causing it. There is an open bug in the vLLM project about this and I wonder if something similar is going on in llama.cpp.

The speed without the MTP drafter is pretty solid so I am content to let more experienced people than me handle things while I learn other stuff, but I might go looking for some testing code that can prove it sometime.

Just saw this:

https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-gg...

Might see if Google has official drafters later.

Majority of my agentic setup is pi / Claude code where every single Chinese models are not as good except commercial 1T models .

Local is a pipe dream . If you can run it cheap occasionally why commercial companies can’t run it cheaper 24/7 and lower the costs ? The answer is simple. Use cases are more demanding and hence you need more from model not less .

Sure if you task is to do a narrow labeling task on 1m records small optimized model is good . If you want to do complex things , it shifts with models advancements

This sounds like something someone at IBM in 1986 would say trying to sell their mainframes. "PCs will never be a thing. No one's gonna want a computer."

I'm seeing some impressive results from folks that can afford 10k+ GPUs right now. But those GPUs will all be hand me downs in 10 years. So pipe dream? Hmmm...... that's not how this industry works.

Those are not GPUs available on iPhones. Will we get there eventually? Maybe! Maybe we end up with GPU clusters built on the edge (e.g. cell towers) for offloading, maybe it’s never economical, maybe a different model architecture makes it simpler, who knows.

But it doesn’t seem anywhere imminent with our current world state.

My computer is 15,000 times faster and costs in inflation adjusted dollars half that of my computer in 1995. There's zero reason to think that won't happen over the next 30 years again.

For whatever reason every generations thinks they are the peak. Naw man. You're just a blip at the bottom of the logarithmic chart.

For me there are a bunch of questions:

- was the pause in model scaling a result of the benefits of RL & SFT being easier to access and quicker than scaling, or was it genuinely the result of scaling being low ROI now?

- are power densities necessary to provide high quality on device inference possible? Can the best, technically feasible, architectures accomodate T scale models and run them off batteries that fit in your hand?

- will thing slow down enough to allow edge depoloyments to realise value vs. centralised deployments.

- do edge use cases drive enough revenue to get this to happen?

- can local inference make up for model scale? Does that make sense in a latency/power race with the central infrastructure? Is there a sweet spot here?

I am not sure about any of the answers...

It has slowed down massively for CPUs at least. e.g. modern CPUs are hardly more than 3-5x faster than those from 10 years ago. There is zero reason to think won’t happen over the next 10 years again.
This isn't an crazy statement (cpu performance metrics have mostly stalled their meteoric rise from prior to the 2000s)

But it also doesn't capture the entire picture.

CPU metrics mostly stalled for two reasons.

1. There wasn't much demand for the extra capacity. Even low end cpus from a decade ago are plenty capable for just browsing the web and typing up documents. It takes a novel use-case to drive demand again (or a desire to do things like play new games).

2. The interest in CPU development shifted in response to mobile. Given point #1 and the state of battery development.... the blocker wasn't "performance". It was "performance per watt". And on that metric you couldn't be more wrong.

Since ~2005, MIPS per watt has improved 15x to 30x.

Also - fun news is that the traditional CPU pipeline really isn't the bottleneck for AI workloads. So we're going to see incredible interest in things like memory bandwidth and other inference related hardware bottlenecks, which haven't already been optimized.

Because I have a fixed expenditure on my local machine, and I can be absolutely sure of the costs over a long horizon (5+ years, for low end hardware life, 10+ years with moderate care). Not something that's true for cloud costs.

Your argument is actually really similar to an argument around the time Uber started kicking into gear and expanding.

It went:

---

"Why should I own a car when it's actually cheaper to just Uber for all my rides, compared to the cost of buying, maintaining, and insuring a car?"

---

And that wasn't an insane argument at that exact moment. Uber was pricing itself in the range of $5-$7 a ride, was novel and high quality.

Except take a look around today... Uber in my area went from ~$5 a ride to ~$27 a ride for the same trip. Uber's quality has also degraded quite a bit. It went from primarily high end, new cars with immaculately clean interiors to "average".

So want to make a wager on what's going to happen with cloud costs over the next decade for inference?

Because my strong hunch is they're going to follow exactly the same trend. They will stop being subsidized, providers WILL downgrade model quality to improve operating costs (and you'll have no control over this outside of enterprise contracts), and companies will start exploring "additional revenue options"... which means they'll shove ads and sponsored content into your results.

Is it worth being ~10-18 months behind the latest and greatest to avoid that entire set of shenanigans? I'd vote yes... I pay one time up front, and get usage limited by my hardware for the cost of electricity over a 10 year timeline. That's a decent deal with no surprises.

You're welcome to rent, but renting makes you subject to the whims of the owners. They're being very nice right now to attract all the flies. That's not a mistake, and it's absolutely a trap.

---

Side note - if you're only able to do labeling tasks with a local model... you're holding something very, very wrong.

Keep working on your agentfu because there is a sweet spot with subagents and parallelizable plans. It’s not about better, it’s about efficiency and picking the right model for the job. You can achieve the same results as frontier models with the right type of planning and context management on local Chinese models.
Depends on what you're doing, of course, but for the small and focused tasks where I'm using agentic AI, local models on my M1 Mac Studio are superb.
I was freaked out being stuck with OpenAI and Anthropic. I setup qwen3.6:35b-mlx on my Mac Studio M1 Ultra and was blown away really. I am no longer afraid that Anthropic or OpenAI will be able to control the market.