LMArena isn't very useful as a benchmark, however I can vouch for the fact that GLM 5.1 is astonishingly good. Several people I know who have a $100/mo Claude Code subscription are considering cancelling it and going all in on GLM, because it's finally gotten (for them) comparable to Opus 4.5/6. I don't use Opus myself, but I can definitely say that the jump from the (imvho) previous best open weight model Kimi K2.5 to this is otherworldly — and K2.5 was already a huge jump itself!
Mind you, a 30B model (3B active) is not going to be comparable to Opus. There are open models that are near-SOTA but they are ~750B-1T total params. That's going to require substantial infrastructure if you want to use them agentically, scaled up even further if you expect quick real-time response for at least some fraction of that work. (Your only hope of getting reasonable utilization out of local hardware in single-user or few-users scenarios is to always have something useful cranking in the background during downtime.)
For a business with ten or more engineers/people-using-ai, it might still make sense to set this up. For an individual though, I can’t imagine you’d make it through to positive ROI before the hardware ages out.
It's hard to tell for sure because the local inference engines/frameworks we have today are not really that capable. We have barely started exploring the implications of SSD offload, saving KV-caches to storage for reuse, setting up distributed inference in multi-GPU setups or over the network, making use of specialty hardware such as NPUs etc. All of these can reuse fairly ordinary, run-of-the-mill hardware.
I'm backing up a big dataset onto tapes, so I wanted to automate it. I have an idle 64Gb VRAM setup in my basement, so I decided to experiment and tasked it with writing an LTFS implementation. LTFS is an open standard for filesystems for tapes, and there's an implementation in C that can be used as the baseline.
So far, Qwen 3.6 created a functionally equivalent Golang implementation that works against the flat file backend within the last 2 days. I'm extremely impressed.
I want to bump this more than just a +1 by recommending everyone try out OpenCode. It can still run on a Codex subscription so you aren’t in fully unfamiliar territory but unlocks a lot of options.
The thing I dislike about OpenCode is the lack of capabilities of their editor, also, resource intensive, for some reason on a VM it chuckles each 30 mins, that I need to discard all sessions, commits, etc.
I don't know if it is bun related, but in task manager, is the thing that is almost at the top always on CPU usage, turns out for me, bun is not production ready at all.
Wish Zed editor had something like BigPickle which is free to use without limits.
Qwen’s 30B models run great on my MBP (M4, 48GB) but the issue I have is cooling - the fan exhaust is straight onto the screen, which I can’t help thinking will eventually degrade it, given the thermal cycling it would go through. A Mac Studio makes far more sense for local inference just for this reason alone.
I have 24GB VRAM available and haven't yet found a decent model or combination.
Last one I tried is Qwen with continue, I guess I need to spend more time on this.
im currently running a custom Gemma4 26b MoE model on my 24gb m2... super fast and it beat deepseek, chatgpt, and gemini in 3 different puzzles/code challenges I tested it on. the issue now is the low context... I can only do 2048 tokens with my vram... the gap is slowly closing on the frontier models
It's a MoE model so I'd assume a cheaper MBP would simply result in some experts staying on CPU? And those would still have a sizeable fraction of the unified memory bandwidth available.
I haven’t tried this myself yet but you would still need enough non-vram ram available to the cpu to offload to cpu, right? This is a fully novice question, I have not ever tried it.
You're correct. If you don't have enough RAM for the model, it can still run but most of it will run on the CPU and be continuously reloaded from the SSD (through mmap).
A medium MoE like 35B can still achieve usable speeds in that setup, mind you, depending on what you're doing.