| HN Mirror

logicprog 102 days ago

LMArena isn't very useful as a benchmark, however I can vouch for the fact that GLM 5.1 is astonishingly good. Several people I know who have a $100/mo Claude Code subscription are considering cancelling it and going all in on GLM, because it's finally gotten (for them) comparable to Opus 4.5/6. I don't use Opus myself, but I can definitely say that the jump from the (imvho) previous best open weight model Kimi K2.5 to this is otherworldly — and K2.5 was already a huge jump itself!

blahblaher 102 days ago

qwen3.5/3.6 (30B) works well,locally, with opencode

Mind you, a 30B model (3B active) is not going to be comparable to Opus. There are open models that are near-SOTA but they are ~750B-1T total params. That's going to require substantial infrastructure if you want to use them agentically, scaled up even further if you expect quick real-time response for at least some fraction of that work. (Your only hope of getting reasonable utilization out of local hardware in single-user or few-users scenarios is to always have something useful cranking in the background during downtime.)

For a business with ten or more engineers/people-using-ai, it might still make sense to set this up. For an individual though, I can’t imagine you’d make it through to positive ROI before the hardware ages out.

It's hard to tell for sure because the local inference engines/frameworks we have today are not really that capable. We have barely started exploring the implications of SSD offload, saving KV-caches to storage for reuse, setting up distributed inference in multi-GPU setups or over the network, making use of specialty hardware such as NPUs etc. All of these can reuse fairly ordinary, run-of-the-mill hardware.

DeathArrow 102 days ago

Since you need at least a few of H100 class hardware, I guess you need at least few tens of coders to justify the costs.

I see the 512GB Mac Studios aren’t for sale anymore but that was a much cheaper path

cyberax 102 days ago

I'm backing up a big dataset onto tapes, so I wanted to automate it. I have an idle 64Gb VRAM setup in my basement, so I decided to experiment and tasked it with writing an LTFS implementation. LTFS is an open standard for filesystems for tapes, and there's an implementation in C that can be used as the baseline.

So far, Qwen 3.6 created a functionally equivalent Golang implementation that works against the flat file backend within the last 2 days. I'm extremely impressed.

Gareth321 102 days ago

It is surprisingly competent. It's not Opus 4.6 but it works well for well structured tasks.

wuschel 102 days ago

What near SOTA open models are you referring to?

I want to bump this more than just a +1 by recommending everyone try out OpenCode. It can still run on a Codex subscription so you aren’t in fully unfamiliar territory but unlocks a lot of options.

The Codex TUI harness is also open source and you can use open models with it, so you can stay in even more familiar territory.

pwython 102 days ago

pi-coding-agent (pi.dev) is also great. I've been using it with Gemma 4 and Qwen 3.6.

equasar 102 days ago

The thing I dislike about OpenCode is the lack of capabilities of their editor, also, resource intensive, for some reason on a VM it chuckles each 30 mins, that I need to discard all sessions, commits, etc.

I don't know if it is bun related, but in task manager, is the thing that is almost at the top always on CPU usage, turns out for me, bun is not production ready at all.

Wish Zed editor had something like BigPickle which is free to use without limits.

Jarred 102 days ago

> turns out for me, bun is not production ready

What issue did you run into?

jherdman 102 days ago

Is this sort of setup tenable on a consumer MBP or similar?

danw1979 102 days ago

Qwen’s 30B models run great on my MBP (M4, 48GB) but the issue I have is cooling - the fan exhaust is straight onto the screen, which I can’t help thinking will eventually degrade it, given the thermal cycling it would go through. A Mac Studio makes far more sense for local inference just for this reason alone.

For a 30B model, you want at least 20GB of VRAM and a 24GB MBP can’t quite allocate that much of it to VRAM. So you’d want at least a 32GB MBP.

richardfey 102 days ago

I have 24GB VRAM available and haven't yet found a decent model or combination. Last one I tried is Qwen with continue, I guess I need to spend more time on this.

_blk 102 days ago

Is there any model that practically compares to Sonnet 4.6 in code and vision and runs on home-grade (12G-24G) cards?

macwhisperer 102 days ago

im currently running a custom Gemma4 26b MoE model on my 24gb m2... super fast and it beat deepseek, chatgpt, and gemini in 3 different puzzles/code challenges I tested it on. the issue now is the low context... I can only do 2048 tokens with my vram... the gap is slowly closing on the frontier models

It's a MoE model so I'd assume a cheaper MBP would simply result in some experts staying on CPU? And those would still have a sizeable fraction of the unified memory bandwidth available.

I haven’t tried this myself yet but you would still need enough non-vram ram available to the cpu to offload to cpu, right? This is a fully novice question, I have not ever tried it.

tredre3 102 days ago

You're correct. If you don't have enough RAM for the model, it can still run but most of it will run on the CPU and be continuously reloaded from the SSD (through mmap).

A medium MoE like 35B can still achieve usable speeds in that setup, mind you, depending on what you're doing.

Gareth321 102 days ago

The Mac Minis (probably 64GB RAM) are the most cost effective.

cpursley 102 days ago

How are you running it with opencode, any tips/pointers on the setup?

cmrdporcupine 102 days ago

GLM 5.1 via an infra provider. Running a competent coding capable model yourself isn't viable unless your standards are quite low.