| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by proxysna 93 days ago

You need to set sampling parameters for the llm. Had the same issue with Qwen3.5 when i first started. You can grab them off the model card page usually.

From Qwen3.6 page:

Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

Instruct (or non-thinking) mode: temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

2 comments

Der_Einzige 93 days ago

min_p author here. min_p is strictly better than top_p and top_k. The big labs don't know shit about sampling, and give absolutely nuts recommendations like this.

set min_p to like 0.3 and ignore top_p and top_k and you'll be fine.

There's better samplers now like top N sigma, top-h, P-less decoding, etc, but they're often not available in your LLM inference engine (i.e. vLLM)

link

JSR_FDED 92 days ago

I’m wondering though, what does extra creativity in code generation actually look like? How is the creativity expressed in code? Does the LLM reach for Bubble Sort instead of Quicksort? Maybe it decides that sorting only the first 10 elements of an array is enough? Funny variable names? Cursing in comments?

link

Der_Einzige 92 days ago

In this case, we are not arguing that min_p is better for "creative code" (you really don't want high temperature anywhere near your code generation, despite the "turning up the heat" framing of our paper) - at least in my post claiming min_p is strictly better than top_p above.

We are instead arguing that min_p handles truncating tokens that are more likely to lead to degeneration/looping because it is partially distribution aware. Fully distribution aware samplers like the ones I mentioned above (i.e. P-less decoding) are strictly superior due to using the whole distribution to decide the truncation at every time step.

Code hallucinations, like many LLM hallucinations, can be seen as accumulation of small amounts of "sampling errors".

link

proxysna 93 days ago

Cool, i am mostly a plumber for these things, but do you have any sort of reading that i can go through to wrap my head around it to some degree?

link

deanc 93 days ago

Yes, have tried all of these (as per the docs). Have you actually tried these? Because I have tried all 3 configurations with agentic coding that you mentioned and have the same result - loops.

link

proxysna 93 days ago

I've used only Qwen3.5 so far for work and was, after initial struggles, successful with GPU setup, no mlx. Ngl the fact that they are using `presence_penalty: 0` and no `max_tokens` is weird after that exact setup caused me "initial struggles", but i've set up a simple docker-compose with vllm and qwen3.6 right now to test it out and it worked perfectly fine for me.

Gist with the compose and example of an output. https://gist.github.com/meaty-popsicle/f883f4a118ff345b430c3...

link