| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by henrixd 6 days ago

I have been heavily relying on Qwen3.6-27B-UD-Q4_K_XL.gguf -model and Pi agent (https://pi.dev/) for local tasks and coding. I have used llama-cpp-turboquant fork with some custom cherrypicked MTP patches from another fork.

I'm running this on V100 32GB (~900GB/s memory bandwidth) with 200,000 context window, --spec-type mpt --spec-draft-n-max 3 --spec-draft-n-min 0 --cache-type-k turbo3 --cache-type-v turbo3 to mention most relevant parts.

I usually get somewhere 45-60 t/s. I believe that speed could be improved slightly by switching to ik_llama.cpp fork and Qwen3.6-27B-IQ4_NL.gguf -model but there's no turboquant support and it's with some other tradeoffs too.