| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by maldie 14 hours ago

llama.cpp to get 115 tok/s on RTX 4090 with Qwen3.6-27B. For example in Windows the latest CUDA variant llama-b9678-bin-win-cuda-13.3-x64.zip and Unsloth UD-Q4_K_XL MTP gguf:

llama-server.exe --host 0.0.0.0 --alias "Qwen3.6-27B-MTP" -m "F:\Qwen3.6-27B-UD-Q4_K_XL-MTP.gguf" -c 75000 -ngl 99 --metrics --temp 0.6 --top-p 0.95 --min-p 0.00 --top-k 20 --presence-penalty 0.0 --no-mmap -t 16 --spec-type draft-mtp --spec-draft-n-max 3 --reasoning on -fa on --parallel 1 -lv 4

Note that this does not use kv cache quants as in my case quants offload to CPU and tanks performance. Also keep in mind this almost maxes VRAM usage so any additional browsers or other programs that use VRAM should be closed. For chat go to http://localhost:8080/ and minimize the window to maximize perf as the web page UI draw itself consumes a lot of GPU perf via constant context switching.

Can try bigger than -c 75000 until perf gets lower than 100 tok/s - that means something is off as windows starts paging out memory or other issues. -c 50000 seems sweetspot if running browsers and stuff that consume 2GB VRAM. If wanting more than -c 140000 then likely need to use a bit smaller model quant.

CPU usage should be near zero, maybe 1 core load. If you see 8+ core load then settings are off and something is offloaded to CPU (for example kv cache). GPU load should be about 100%, meaning it utilizes work optimally in this case.

-t 16 can be omitted or set to the amount of physical cores, not important in this dense model that is 100% in GPU.

Can be pushed to 125 tok/s with that model if using --spec-draft-n-max 4 but VRAM usage also increases, so context needs to be smaller.

If speed is not important and want max context length then remove the draft-mtp parameters and also might need to use k and v quants like --cache-type-v q8_0, leave k f16 if possible to keep quality.