| HN Mirror

The model is rewarded for accuracy. For each puzzle there are a few multiple choice questions. If it got 1 out of 4 correct, for example, its reward would be 0.25.

Then group relative advantages are calculated. If you have 16 different responses and the average accuracy is 0.5, then you subtract that from each reward and divide by the standard deviation. Say it's also 0.25. Then the advantage for our example would be (0.25 - 0.5) / 0.25 = -1.

The advantages are then used to increase (or decrease) the probability of sampling those tokens again. Since our example was negative, we penalize the model for underperforming with that response.