| Out of interest, what machine and model are you running it on? I tried the qwen3.6-27b Q6_k GUFF in llama.cpp and LM Studio on my M2 MacBook Pro 32GB machine last week, and I barely get a token a second with either. What sort of speed should I be expecting? I tried some of the Llama 3 34b (nous-capybara?) models two years ago with llama.cpp, and I seem to remember getting a few tokens a second then, so not sure if I've got something completely mis-configured, or I just have unreasonable expectations. Or maybe qwen 3.x is slower for some reason? (Is it mixture of experts?) I'm not expecting it to be instant, but what I'm currently seeing is not really usable. |
- A 27B "dense" model
- A 35B "Mixture of Experts" model, which activates only 3B parameters for each token.
For your hardware, I strongly recommend `unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M`. I have an M1 Max with 32GB VRAM from 2021 that can read at ~300-500 tokens/sec and write at ~30 tokens/sec with llama-cpp's default settings, which is plenty fast. The 27B model can read ~70tok/sec and write ~5tok/sec.
The 35B MoE model technically takes slightly more memory but is much faster because it's doing 1/9th the work. It's not quite as "smart", but it's comparable.