|
|
|
|
|
by hadlock
3 days ago
|
|
Qwen's ~30B-class models are genuinely good enough for use if you can find a machine with enough memory bandwidth to run them at 30-90 tokens/second. It's been extremely telling that Qwen stopped releasing 120b class models. At some point in the next 10 years (maybe 3?) someone is going to release an Opus 4.5 class 256B model you can run locally. Right now our engineers use about $800/mo worth of opus tokens; at that rate the ROI for local LLM is ~10 months |
|
I've been on claude's opus 4.5/6/7 for work for a couple months, and I finally got back to running Qwen A3B 35B... it's incredibly performant and quite capable on semi-reasonable local hardware.
I get ~150 tokens/s on dual nvidia RTX 3090s and can fit the whole 300k context into gpu on a UD-Q4-K-XL quant gguf.
Combined with Pi as a harness, and I'm surprised to find that it feels about as capable as claude did 8 months ago (their 3.x models).
It's not Opus 4.5 levels yet, but it's good enough for a LOT of basic work. I actually downgraded my personal anthropic subscription because Qwen is absolutely fine for implementation work. I still let a better model write a plan, but then I can just switch over to Qwen to implement.
I don't think we're 10 years away from opus 4.5 levels running on cheap consumer hardware. I think we're probably closer to 18 months away, and I suspect it'll be in the 30-60b range, not the 256b range.
PC manufacturers also seem to be betting on local, with a LOT of focus on 64 to 128gb unified RAM machines.