| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zby 550 days ago
	I think the reward is relative to other sampled answers for the same question. This way the signal is strong at the very margin of what is possible with a given model and there is less noise in it with impossible or too easy questions. There is some confusion - because they do compute that simple reward, but then they convert it to a relative value and call it advantage. And I think they use that advantage to update the model - not the base reward.

1 comments

krackers 550 days ago

Yes you're right, in their paper I think they say the process of sampling multiple traces then taking relative rewards is supposed to monte-carlo approximate the value network? I don't really have the intuition for that, but it does make sense that rather than simply nudging probabilities in the direction of the trace with the highest absolute reward, you want to favor the trace which had the best reward relative to current state. E.g. for quick intuition if absolute rewards for traces were {0, 0, 0, 0.01} then using absolute rewards would only give a weak signal (nudge weights proportional to 0.01 * logprob) for the last trace, but using relative rewards (based on z-score) of 1.5 * logprob.

zby 550 days ago

Not only that - if you have {0,0,0,0.01} - then the probability that you would get any reward at one shot would be very low. And also I have the intuition that giving the rewards to traces at the edge is more efficient - because the model needs only a small perturbation to get right. If you gave negative rewards to traces that are very far from being right - then the model might be steered in a wrong direction.