|
You need to set sampling parameters for the llm. Had the same issue with Qwen3.5 when i first started. You can grab them off the model card page usually. From Qwen3.6 page: Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0 Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0 Instruct (or non-thinking) mode: temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0 |
set min_p to like 0.3 and ignore top_p and top_k and you'll be fine.
There's better samplers now like top N sigma, top-h, P-less decoding, etc, but they're often not available in your LLM inference engine (i.e. vLLM)