|
|
|
|
|
by n8henrie
55 days ago
|
|
I'm still fairly new to local LLMs, spent some time setting up and testing a few Qwen3.6-35B-A3B models yesterday (mlx 4b and 8b, gguf Q4_K_M and Q4_K_XL I think). Was impressed at how they ran on my 64G M4. It looks like this new model is slightly "smarter" (based on the tables in TFA) but requires more VRAM. Is that it? The "dense" part being the big deal? As 27B < 35B, should we expect some quantized models soon that will bring the VRAM requirement down? |
|
This model is a "dense" model. It will be much slower on macs. Concretely, on a M4 Pro, at Q6 gguf, it was ~9tok/s for me. 35-A3B (at Q4, with mlx, so not a fair comparison) was ~70 tok/s by comparison.
In general dedicated GPUs tend to do better with these kinds of "dense" models, though this becomes harder to judge when the GPU does not have enough VRAM to keep the model fully resident. For this model, I would expect if you have >=24GB VRAM you'd be fine, e.g. an NVIDIA {3,4,5}090-type thing.