Hacker News new | ask | show | jobs
by the_pwner224 71 days ago
I have a 128 GB Strix Halo tablet (same as the other commenter here with the Framework Desktop). I'm using the larger Gemma 4 26B-A4B model (only 28 GB @ Q8) and it's been working great and runs very fast.

It's a 100% replacement for free ChatGPT/Gemini.

Compared to the paid pro/thinking models... Gemma does have reasoning, and I have used the reasoning mode for some tax & legal/accounting advice recently as well as other misc problems. It's worked well for that, but I haven't tried any real difficult tasks. From what I've heard re. agentic coding, the open weight models are ~18-24 months behind Anthropic & Google's SOTA.

Qwen 3.5 122B-A10B should just fit into 128 GB with a Q4/5 and may be a bit smarter. There's apparently also a similar sized Gemma 4 model but they haven't released it yet, the 26B was the largest released.

2 comments

There's a 31B dense model in the Gemma 4 series that's obviously going to be smarter (though a whole lot slower) than the MoE 26A4B.
I tried it and it was unusably slow at ~5-6 TPS. 26A4B gets close to 40 TPS which is faster than you can read, and still pretty quick with reasoning enabled.
Sure, 26B models on beefy desktop silicon are finally nipping at the heels of commercial APIs, but this is a mobile thread. On a phone with 8GB of RAM and passive cooling, your tokens per second (t/s) are going to fall off a cliff after the first minute of sustained compute