| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mertnesvat 517 days ago

The most interesting part of DeepSeek's R1 release isn't just the performance - it's their pure RL approach without supervised fine-tuning. This is particularly fascinating when you consider the closed vs open system dynamics in AI.

Their model crushes it on closed-system tasks (97.3% on MATH-500, 2029 Codeforces rating) where success criteria are clear. This makes sense - RL thrives when you can define concrete rewards. Clean feedback loops in domains like math and coding make it easier for the model to learn what "good" looks like.

What's counterintuitive is they achieved this without the usual supervised learning step. This hints at a potential shift in how we might train future models for well-defined domains. The MIT license is nice, but the real value is showing you can bootstrap complex reasoning through pure reinforcement.

The challenge will be extending this to open systems (creative writing, cultural analysis, etc.) where "correct" is fuzzy. You can't just throw RL at problems where the reward function itself is subjective.

This feels like a "CPU moment" for AI - just as CPUs got really good at fixed calculations before GPUs tackled parallel processing, we might see AI master closed systems through pure RL before cracking the harder open-ended domains.

The business implications are pretty clear - if you're working in domains with clear success metrics, pure RL approaches might start eating your lunch sooner than you think. If you're in fuzzy human domains, you've probably got more runway.

6 comments

aimanbenbaha 517 days ago

Interestingly this point was indicated by Karpathy last summer that RLHF is barely RL. He said it would be very difficult to apply pure reinforcement learning on open-domains. This is why RLHF are a shortcut to fill this gap but still because the reward model is trained on human vibes checks the LLM could easily game the RM by giving out misleading responses or gaming the system.

Importantly the barrier is that open domains are too complex and too undefined to have a clear reward function. But if someone cracks that — meaning they create a way for AI to self-optimize in these messy, subjective spaces — it'll completely revolutionize LLMs through pure RL.

Here's the link of the tweet: https://x.com/karpathy/status/1821277264996352246

link

leobg 513 days ago

The whole point of RLHF is to make up for the fact that there is no loss function for a good answer in terms of token ids or their order. A good answer can come in many different forms and shapes.

That’s why all those models fine tuned on (instruction, input, answer) tuples are essentially lobotomized. They’ve been told that, for the given input, only the output given in the training data is correct, and any deviation should be “punished”.

In truth, for each given input, there are many examples of output that should be reinforced, many examples of output that should be punished, and a lot in between.

When BF Skinner used to train his pigeons, he’d initially reinforce any tiny movement that at least went in the right direction. For example, instead of waiting for the pigeon to peck the lever directly (which it might not do for many hours), he’d give reinforcement if the pigeon so much as turned its head towards the lever. Over time, he’d raise the bar. Until, eventually, only clear lever pecks would receive reinforcement.

We should be doing the same when taming LLMs from their pretraining as document completers into assistants.

link

hb-robo 517 days ago

Layman question here since this isn't my field: how do you achieve success on closed-system tasks without supervision? Surely at some point along the way, the system must understand whether their answers and reasoning are correct.

link

boole1854 517 days ago

In their paper, they explain that "in the case of math problems with deterministic results, the model is required to provide the final answer in a specified format (e.g., within a box), enabling reliable rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be used to generate feedback based on predefined test cases."

Basically, they have an external source-of-truth that verifies whether the model's answers are correct or not.

link

davmre 517 days ago

You're totally right there must be supervision; it's just a matter of how the term is used.

"Supervised learning" for LLMs generally means the system sees a full response (eg from a human expert) as supervision.

Reinforcement learning is a much weaker signal: the system has the freedom to construct its own response / reasoning, and only gets feedback at the end whether it was correct. This is a much harder task, especially if you start with a weak model. RL training can potentially struggle in the dark for an exponentially long period before stumbling on any reward at all, which is why you'd often start with a supervised learning phase to at least get the model in the right neighborhood.

link

aomix 517 days ago

They use other models to judge correct-ness and when possible just ask the model output something that can be directly verified. Like math equations that can be checked 1:1 against the correct answer.

link

jjtheblunt 517 days ago

> the real value is showing you can bootstrap complex reasoning through pure reinforcement.

This made me smile, as I thought (non snarkily) that's what living beings do.

link

fsndz 517 days ago

this ! and the truth is is there that much corporate domains without "clear success metrics" ?

link

petra 517 days ago

You also need to be able to test your solution, on how sucsessful it is.

In some domains it is harder than math and code.

link

fsndz 517 days ago

true. I think simulations will help a lot in that direction. Imagine if you can do RL a bit like DeepSeek for R1 but on corporate tasks. https://open.substack.com/pub/transitions/p/deepseek-is-comi...

link

fsndz 517 days ago

emphasis on corporate

link

data_maan 517 days ago

The MIT licence is for code only

link