Hacker News new | ask | show | jobs
by hypfer 18 days ago
Qwen3.6-27B-UD-Q4_K_XL can run at 45t/s with 131k q8 context on an RTX 4090.

That is pretty usable. You could get 65t/s or more with MTP, but only if you drop the context size, which I would advise against.

Results are better with 256k context and a larger quant, however, that's not going to fit on the 4090 you already had lying around for playing cyberpunk 2077.

The MoE models make me rather unhappy. Idk. They feel braindead to me, but YMMV.