|
|
|
|
|
by gcr
22 days ago
|
|
There are two flavors of Qwen 3.6: - A 27B "dense" model - A 35B "Mixture of Experts" model, which activates only 3B parameters for each token. For your hardware, I strongly recommend `unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M`. I have an M1 Max with 32GB VRAM from 2021 that can read at ~300-500 tokens/sec and write at ~30 tokens/sec with llama-cpp's default settings, which is plenty fast. The 27B model can read ~70tok/sec and write ~5tok/sec. The 35B MoE model technically takes slightly more memory but is much faster because it's doing 1/9th the work. It's not quite as "smart", but it's comparable. |
|
Recommend https://www.reddit.com/r/LocalLLaMA/ as a great source for this type of discussion.