| The most interesting part of DeepSeek's R1 release isn't just the performance - it's their pure RL approach without supervised fine-tuning. This is particularly fascinating when you consider the closed vs open system dynamics in AI. Their model crushes it on closed-system tasks (97.3% on MATH-500, 2029 Codeforces rating) where success criteria are clear. This makes sense - RL thrives when you can define concrete rewards. Clean feedback loops in domains like math and coding make it easier for the model to learn what "good" looks like. What's counterintuitive is they achieved this without the usual supervised learning step. This hints at a potential shift in how we might train future models for well-defined domains. The MIT license is nice, but the real value is showing you can bootstrap complex reasoning through pure reinforcement. The challenge will be extending this to open systems (creative writing, cultural analysis, etc.) where "correct" is fuzzy. You can't just throw RL at problems where the reward function itself is subjective. This feels like a "CPU moment" for AI - just as CPUs got really good at fixed calculations before GPUs tackled parallel processing, we might see AI master
closed systems through pure RL before cracking the harder open-ended domains. The business implications are pretty clear - if you're working in domains with clear success metrics, pure RL approaches might start eating your lunch sooner than you think. If you're in fuzzy human domains, you've probably got more runway. |
Importantly the barrier is that open domains are too complex and too undefined to have a clear reward function. But if someone cracks that — meaning they create a way for AI to self-optimize in these messy, subjective spaces — it'll completely revolutionize LLMs through pure RL.
Here's the link of the tweet: https://x.com/karpathy/status/1821277264996352246