|
|
|
|
|
by cshimmin
1146 days ago
|
|
Yeah but even in that case, "reward" is just the thing a NN is trying to predict. The NN itself is not receiving the reward (or any punishment). Instead, it's following gradient signals to improve that estimate of reward, which is then used as a proxy for an optimal policy decision. |
|