| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by miki123211 507 days ago

It is possible to learn to reason from scratch, that's what R1-0 did, but the resulting chains of thought aren't legible to humans.

To quote DeepSeek directly:

> DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL.

1 comments

dchichkov 507 days ago

If you look at the benchmarks of the DeepSeek-V3-Base, it is quite capable, even in 0-shot: https://huggingface.co/deepseek-ai/DeepSeek-V3-Base#base-mod... This is not from scratch. These benchmark numbers are an indication that the base model already had a large number of reasoning/LLM tokens in the pre-training set.

On the other hand, my take on it, the ability to do reasoning in a long context is a general capability. And my guess is that it can be bootstrapped from scratch, without having to do training on all of the internet or having to distill models trained on the internet.

cma 507 days ago

> These benchmark numbers are an indication that the base model already had a large number of reasoning/LLM tokens in the pre-training set.

But we already know that is the case: the Deepseek v3 paper says it was posttrained partly with an internal version of R1:

> Reasoning Data. For reasoning-related datasets, including those focused on mathematics, code competition problems, and logic puzzles, we generate the data by leveraging an internal DeepSeek-R1 model. Specifically, while the R1-generated data demonstrates strong accuracy, it suffers from issues such as overthinking, poor formatting, and excessive length. Our objective is to balance the high accuracy of R1-generated reasoning data and the clarity and conciseness of regularly formatted reasoning data.

And deepseekmath did a repeated cycle of this kind of thing mixing in 10% of old previously seen data with new generated data from last gen in a continuous bootstrap.