| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kgeist 5 days ago

I have to disagree with most claims. I run Qwen3.6-27b at 260k context and 40-60 tok/sec. It handles most coding problems as well as Sonnet 4.6 under OpenCode on our production tasks. (As an experiment, I run the same prompts for the same issues in parallel for Qwen 3.6 and Sonnet 4.6 and usually see little difference in performance). I see zero degradation from quantization in practice.

Settings: RTX 5090, 5-bit weights (Unsloth), FP8 KV cache.

Last time I tried running large MoEs on this PC, they had inferior quality at 2-3 bits compared to much smaller dense models at 5-6 bits, and were slower anyway.

1 comments

zozbot234 5 days ago

A 260k context (close to the stock maximum for Qwen, though it's possible to extend it) will take ~16GB RAM for storing the KV cache, barring quantization tricks which severely degrade quality. That's a whole lot more than what DeepSeek requires for a similar context length, and makes it infeasible to batch multiple inferences together. This used to be the status quo for consumer inference, in fact it still is for models like Kimi and GLM (which can sometimes be smarter than even DeepSeek V4 Pro!) but we can also do better nowadays.