Hacker News new | ask | show | jobs
by antirez 476 days ago
Note the massive context length (130k tokens). Also because it would be kinda pointless to generate a long CoT without enough context to contain it and the reply.

EDIT: Here we are. My first prompt created a CoT so long that it catastrophically forgot the task (but I don't believe I was near 130k -- using ollama with fp16 model). I asked one of my test questions with a coding question totally unrelated to what it says:

<QwQ output> But the problem is in this question. Wait perhaps I'm getting ahead of myself.

Wait the user hasn't actually provided a specific task yet. Let me check again.

The initial instruction says:

"Please act as an AI agent that can perform tasks... When responding, first output a YAML data structure with your proposed action, then wait for feedback before proceeding."

But perhaps this is part of a system prompt? Wait the user input here seems to be just "You will be given a problem. Please reason step by step..." followed by a possible task? </QwQ>

Note: Ollama "/show info" shows that the context size set is correct.

10 comments

> Note: Ollama "/show info" shows that the context size set is correct.

That's not what Ollama's `/show info` is telling you. It actually just means that the model is capable of processing the context size displayed.

Ollama's behavior around context length is very misleading. There is a default context length limit parameter unrelated to the model's capacity, and I believe that default is a mere 2,048 tokens. Worse, when the prompt exceeds it, there is no error -- Ollama just silently truncates it!

If you want to use the model's full context window, you'll have to execute `/set parameter num_ctx 131072` in Ollama chat mode, or if using the API or an app that uses the API, set the `num_ctx` parameter in your API request.

Ok, this explains why QwQ is working great on their chat. Btw I saw this thing multiple times: that ollama inference, for one reason or the other, even without quantization, somewhat had issues with the actual model performance. In one instance the same model with the same quantization level, if run with MLX was great, and I got terrible results with ollama: the point here is not ollama itself, but there is no testing at all for this models.

I believe that models should be released with test vectors at t=0, providing what is the expected output for a given prompt for the full precision and at different quantization levels. And also for specific prompts, the full output logits for a few tokens, so that it's possible to also compute the error due to quantization or inference errors.

Yeah the state of the art is pretty awful. There have been multiple incidents where a model has been dropped on ollama with the wrong chat template, resulting in it seeming to work but with greatly degraded performance. And I think it's always been a user that notices, not the ollama team or the model team.
I'm grateful for anyone's contributions to anything, but I kinda shake my head about ollama. the reason stuff like this happens is they're doing the absolute minimal job necessary, to get the latest model running, not working.

I make a llama.cpp wrapper myself, and it's somewhat frustrating putting effort in for everything from big obvious UX things, like error'ing when the context is too small for your input instead of just making you think the model is crap, to long-haul engineering commitments, like integrating new models with llama.cpp's new tool calling infra, and testing them to make sure it, well, actually works.

I keep telling myself that this sort of effort pays off a year or two down the road, once all that differentiation in effort day-to-day adds up. I hope :/

Can you link your wrapper? I've read and run up against a lot of footguns related to Ollama myself and I think surfacing community efforts to do better would be quite useful.
Cheers, thanks for your interest:

Telosnex, @ telosnex.com --- fwiw, general positioning is around paid AIs, but there's a labor-of-love llama.cpp backed on device LLM integration that makes them true peers, both in UI and functionality. albeit with a warning sign because normie testers all too often wander into trying it on their phone and killing their battery.

My curse is the standard engineer one - only place I really mention it is one-off in comments like here to provide some authority on a point I want to make...I'm always one release away from it being perfect enough to talk up regularly.

I really really need to snap myself awake and ban myself from the IDE for a month.

But this next release is a BFD, full agentic coding, with tons of tools baked in, and I'm so damn proud to see the extra month I've spent getting llama.cpp tools working agentically too. (https://x.com/jpohhhh/status/1897717300330926109, real thanks is due to @ochafik at Google, he spent a very long term making a lot of haphazard stuff in llama.cpp coalesce. also phi-4 mini. this is the first local LLM that is reasonably fast and an actual drop-in replacement for RAG and tools, after my llama.cpp patch)

Please, feel free to reach out if you try it and have any thoughts, positive or negative. james @ the app name.com

The test vectors idea is pretty interesting! That's a good one.

I haven't been able to try out QwQ locally yet. There seems to be something wrong with this model on Ollama / my MacBook Pro. The text generation speed is glacial (much, much slower than, say Qwen 72B at the same quant). I also don't see any MLX versions on LM Studio yet.

Ollama defaults to a context of 2048 regardless of model unless you override it with /set parameter num_ctx [your context length]. This is because long contexts make inference slower. In my experiments, QwQ tends to overthink and question itself a lot and generate massive chains of thought for even simple questions, so I'd recommend setting num_ctx to at least 32768.

In my experiments of a couple mechanical engineering problems, it did fairly well in final answers, correctly solving mechanical engineering problems that even DeepSeek r1 (full size) and GPT 4o did wrong in my tests. However, the chain of thought was absurdly long, convoluted, circular, and all over the place. This also made it very slow, maybe 30x slower than comparably sized non-thinking models.

I used a num_ctx of 32768, top_k of 30, temperature of 0.6, and top_p of 0.95. These parameters (other than context length) were recommended by the developers on Hugging Face.

I always see:

  /set parameter num_ctx <value>
Explained but never the follow up:

  /save <custom-name>
So you don't have to do the parameter change every load. Is there a better way or is it kind of like setting num_ctx in that "you're just supposed to know"?
You can also set

    OLLAMA_CONTEXT_LENGTH=<tokens>
as an environment variable to change ollama's default context length.
I think that will not work if you use the OpenAI compatible API endpoint.
I tried this with ollama run, and it had no effect at all.
that env parameter is brand new, did you update ollama?
My understanding is that top_k and top_p are two different methods of decoding tokens during inference. top_k=30 considers the top 30 tokens when selecting the next token to generate and top_p=0.95 considers the top 95 percentile. You should need to select only one.

https://github.com/ollama/ollama/blob/main/docs/modelfile.md...

Edit: Looks like both work together. "Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)"

Not quite sure how this is implemented - maybe one is preferred over the other when there are enough interesting tokens!

They both work on a sorted list of tokens by probability. top_k selects a fixed amount of tokens, top_p selects the top tokens until the sum of probabilities passes the threshold p. So for example if the top 2 tokens have a .5 and .4 probability, then a 0.9 top_p would stop selecting there.

Both can be chained together and some inference engines let you change the order of the token filtering, so you can do p before k, etc. (among all other sampling parameters, like repetition penalty, removing top token, DRY, etc.) each filtering step readjusts the probabilities so they always sum to 1.

"My first prompt created a CoT so long that it catastrophically forgot the task"

Many humans would do that

I tried the 'Strawberry' question which generated nearly 70k words of CoT.
I think you guys might be using too low of a temperature, it never goes beyond like 1k thinking tokens for me.
lol did it at least get it right?
It's a hard problem, that's a lot to ask.
Yeah it did the same in my case too. it did all the work in the <think> tokens. but did not spit out the actual answer. I was not even close to 100K tokens
If you did not change the context length, it is certain that it is not 2k or so. In "/show info" there is a field "context length" which is about the model in general, while "num_ctx" under "parameters" is the context length for the specific chat.

I use modelfiles because I only use ollama because it has easy integration with other stuff eg with zed, so this way I can easily choose models with a set context size directly.

Here nothing fancy, just

    FROM qwq
    PARAMETER num_ctx 100000
You save this somewhere as a text file, you run

    ollama create qwq-100k -f path/to/that/modelfile
and you now have "qwq-100k" in your list of models.
From https://huggingface.co/Qwen/QwQ-32B

Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling configuration only when processing long contexts is required.

Sorry, could you please explain what this means? I'm not into machine learning, so I don't get the jargon.
Well I can't be positive, but it looks like some of the factors that support a long context length might be set wrong. https://blog.eleuther.ai/yarn/
Can’t wait to see if my memory can even acocomodate this context
Oddly, the Chinese LLM host SiliconFlow only makes it available with 32k context, which is even smaller than their DeepSeek-R1 offering.
that's interesting... i've been noticing similar issues with long context windows & forgetting. are you seeing that the model drifts more towards the beginning of the context or is it seemingly random?

i've also been experimenting with different chunking strategies to see if that helps maintain coherence over larger contexts. it's a tricky problem.

Neither lost-in-the-middle nor long context performance have seen a lot of improvement in the recent year. It's not easy to generate long training examples that also stay meaningful, and all existing models still become significantly dumber after 20-30k tokens, particularly on hard tasks.

Reasoning models probably need some optimization constraint put on the length of the CoT, and also some priority constraint (only reason about things that need it).