| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by iceman_w 465 days ago
	RL constrains the space of possible output token sequences to what is likely to lead to the correct answer. So we are inherently making a trade-off to reduce variance. A non-RL model will have higher variance, so given enough attempts, it will come up with some correct answers that an RL model can't.