| I want to echo this. I've been on claude's opus 4.5/6/7 for work for a couple months, and I finally got back to running Qwen A3B 35B... it's incredibly performant and quite capable on semi-reasonable local hardware. I get ~150 tokens/s on dual nvidia RTX 3090s and can fit the whole 300k context into gpu on a UD-Q4-K-XL quant gguf. Combined with Pi as a harness, and I'm surprised to find that it feels about as capable as claude did 8 months ago (their 3.x models). It's not Opus 4.5 levels yet, but it's good enough for a LOT of basic work. I actually downgraded my personal anthropic subscription because Qwen is absolutely fine for implementation work. I still let a better model write a plan, but then I can just switch over to Qwen to implement. I don't think we're 10 years away from opus 4.5 levels running on cheap consumer hardware. I think we're probably closer to 18 months away, and I suspect it'll be in the 30-60b range, not the 256b range. PC manufacturers also seem to be betting on local, with a LOT of focus on 64 to 128gb unified RAM machines. |
I am a fully-burned-out freelancer (in the last couple of years so severely and totally that I thought I had early onset dementia, and I am still not sure I don't). I don't really have an off-ramp to anything else yet, but the sea-change in the industry has been contributing to my feeling that I should knock it on the head.
I must get past broad understanding of AI to deep understanding, but I have to find a way to do this which sits well with freelancer ethics (sustainability, stability, control of destiny).
So I decided I would start out with that operating principle that ultimately this stuff is just going to be local: models will eventually hit some level of practicality for most tasks and technological progress guarantees that they will eventually run on desktops.
I decided to learn how to run models locally properly, see how far I get with opencode (and Pi and Zed experiments), and grow outwards from there to metered models (opencode go, openrouter etc.)
Knowledge first; what can I do that meaningfully changes my outcomes and confidence with no cost and no exposure to sudden change?
I have a secondhand M1 Max (excellent GPU bandwidth), and I am really shocked to find that arguably that level of practicality is already here.
Qwen 3.6 35B can really do a lot. And — not sure if you have tested it — but in some ways I think the Gemma 4 26B is better. Particularly for more commonplace dev tech — it is very knowledgeable about the sort of low-end web dev stack that is most common (Wordpress, PHP, MySQL).
I have been getting 75 tokens/sec with (GGUF) Gemma-4 26B QAT and MTP. (Can't get anywhere close with MLX, for some reason.)
A similar sort of speed with an MLX Qwen 3.6 35B. I have a sneaking suspicion that maybe llama.cpp is now faster than MLX on this older kit so I might try seeing what llama.cpp can do there, too.
Not blazing fast, but fast enough that there are plenty of experiments and small jobs I can do before I even get to using Big Pickle!