Hacker News new | ask | show | jobs
by mft_ 25 days ago
The 27B model is dense, so is relatively slow. The 35B-A3B model is marginally weaker but being MoE is much faster - like ~4-8x faster in basic benchmarks on my M1 Max.

For comparison, I just ran a couple of quick benchmarks (default settings) with llama-bench:

Qwen3.6-35B-A3B at Q6_K_XL gave 858 t/s pp512 (prompt processing) and 43 t/s tg128 (token generation).

Qwen3.6-27B at Q4_K_XL gave 103 t/s pp512 and 8 t/s tg128.

2 comments

Have you tried enabling MTP? Those numbers are similar to what I was getting on my Strix Halo box, but configuring/enabling MTP doubled the TG speed of the 27B model (18-20 t/s now).
Thanks - I’m in the process. I’ve tried briefly, but so far it appears marginally slower. (Noting that llama-bench doesn’t support MTP yet so you’re reduced to running different prompts and eyeballing the log.)

So I’m assuming I’ve done something wrong along the way, but I’ve not had time yet to explore it.

Thanks for the info.