| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cshimmin 1146 days ago
	Yeah but even in that case, "reward" is just the thing a NN is trying to predict. The NN itself is not receiving the reward (or any punishment). Instead, it's following gradient signals to improve that estimate of reward, which is then used as a proxy for an optimal policy decision.