| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by piecerough 505 days ago
	SFT forces the model to output _that_ reasoning trace you have in data. RL allows whatever reasoning trace and only penalizes it if it does not reach the same answer