|
|
|
|
|
by d4rkp4ttern
77 days ago
|
|
For token-generation speed, a challenging test is to see how it performs in a code-agent harness like Claude Code, which has anywhere between 15-40K tokens from the system prompt itself (+ tools/skills etc). Here the 26B-A4B variant is head and shoulders above recent open-weight models, at least on my trusty M1 Max 64GB MacBook. I set up Claude Code to use this variant via llama-server, with 37K tokens initial context, and it performs very well: ~40 tokens/sec, far better than Qwen3.5-35B-A3B, though I don't know yet about the intelligence or tool-calling consistency. Prompt processing speed is comparable to the Qwen variant at ~400 tok/s. My informal tests, all with roughly 30K-37K tokens initial context: ┌────────────────────┬───────────────┬────────────┐
│ Model │ Active Params │ tg (tok/s) │
├────────────────────┼───────────────┼────────────┤
│ Gemma-4-26B-A4B │ 4B │ ~40 │
├────────────────────┼───────────────┼────────────┤
│ GPT-OSS-20B │ 3.6B │ ~17-38 │
├────────────────────┼───────────────┼────────────┤
│ Qwen3-30B-A3B │ 3B │ ~15-27 │
├────────────────────┼───────────────┼────────────┤
│ GLM-4.7-Flash │ 3B │ ~12-13 │
├────────────────────┼───────────────┼────────────┤
│ Qwen3.5-35B-A3B │ 3B │ ~12 │
├────────────────────┼───────────────┼────────────┤
│ Qwen3-Next-80B-A3B │ 3B │ ~3-5 │
└────────────────────┴───────────────┴────────────┘
Full instructions for running this and other open-weight models with Claude Code are here:https://pchalasani.github.io/claude-code-tools/integrations/... |
|