R1 is trained for a context length of 128K. Where are you getting 8K/32K? The model doesn't distinguish "thinking" tokens and "output" tokens, so this must be some specific API limitations.
> max_tokens:The maximum length of the final response after the CoT output is completed, defaulting to 4K, with a maximum of 8K. Note that the CoT output can reach up to 32K tokens, and the parameter to control the CoT length (reasoning_effort) will be available soon. [1]
I’m using it through Kagi which doesn’t use Deepseek’s official API [1]. That limitation from the docs seems to be everywhere.
In practice I don’t think anyone can economically host the whole model plus the kv cache for the entire context size of 128k (and I’m skeptical of Deepseek’s claims now anyway).
Edit: a Kagi team member just said on Discord that they’ll be increasing max tokens next release
He's just repeating a lot of disinformation that has been released about deepseek in the last few days. People who took the time to test DeepSeek models know that the results have the same or better quality for coding tasks.
Benchmarks are great to have but individual/org experiences on specific codebases still matter tremendously.
If an org consistently finds one model performs worse on their corpus than another, they aren't going to keep using it because it ranks higher in some set of benchmarks.
But you should also be very wary of these kind of anecdotes, and this thread highlights exactly why. That commenter says in another comment (https://news.ycombinator.com/item?id=42866350) that the token limitation that he is complaining about has actually nothing to do with DeepSeek's model or their API, but is a consequence of an artificial limit that Kagi imposes. In other words, his conclusion about DeepSeek is completely unwarranted.
It mashed the header and C++ file together, which is egregiously bad in the context of QT. This isn’t a new library, it’s been around for almost thirty years. Max token sizes have nothing to do with that.
I invite anyone to post a chat transcript showing a successful run of R1 against this prompt (and please tell me which API/service it came from so I can go use it too!)
I wasn't suggesting using the anecdotes of others to make a decision.
I'm talking about individuals and organizations making a decision on whether or not to use a model based on their own testing. That's what ultimately matters here.
[1] https://api-docs.deepseek.com/guides/reasoning_model