| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ihowlatthemoon 101 days ago

I run a setup similar to yours and I've had the best results with Qwen3.5 27B. Specifically the Q4_K_M variant. https://unsloth.ai/docs/models/qwen3.5

I use llama-server that comes with llama.cpp instead of using ollama. Here are the exact settings I use.

llama-server -ngl 99 -c 192072 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --host 0.0.0.0 --sleep-idle-seconds 300 -m Qwen3.5-27B-Q4_K_M.gguf

1 comments

greyskull 101 days ago

Thanks, I'll have to continue experimenting. I just ran this model Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL and it works, but if gemini is to be believed this is saturating too much VRAM to use for chat context.

How did you land on that model? Hard to tell if I should be a) going to 3.5, b) going to fewer parameters, c) going to a different quantization/variant.

I didn't consider those other flags either, cool.

Are you having good luck with any particular harnesses or other tooling?

ihowlatthemoon 100 days ago

35B-A3B means it's a MoE model with 35B total parameters but with only 3B active at once. The one I use is the 27B dense model. Usually, dense models give better responses, but are slower than the MoE. With your 4090, you should be able to get about 50 tok/s with the dense model, which is more than enough for practical use.

If you want to keep using the same model, these settings worked for me.

llama-server -ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --host 0.0.0.0 --sleep-idle-seconds 300 -m Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf

For the harness, I use pi (https://pi.dev/). And sometimes, I use the Roo Code plugin for VS Code. (https://roocode.com/)

I prefer simplicity in my tooling, so I can understand them easier. But you might have better luck with other harnesses.