| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hodapp 493 days ago
	You are right; the advanced in DeepSeek-R1 used RL almost solely because of the chain-of-thought sequences they were generating and training it on.