|
|
|
|
|
by zby
503 days ago
|
|
I think the reward is relative to other sampled answers for the same question. This way the signal is strong at the very margin of what is possible with a given model and there is less noise in it with impossible or too easy questions. There is some confusion - because they do compute that simple reward, but then they convert it to a relative value and call it advantage. And I think they use that advantage to update the model - not the base reward. |
|