| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zby 550 days ago
	I think is where the relative rewards come to play - they sample many thinking traces and reward those that are correct. This works at the current 'cutting edge' for the model - exactly where it could be improved.