| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by HarHarVeryFunny 494 days ago
	DeepSeek's approach with R1 wasn't pure RL - they used RL only to develop R0 from their V3 base model, but then went though two iterations of using current model to generate synthetic reasoning data, SFT on that, then RL fine-tuning, and repeat.