Hacker News new | ask | show | jobs
by goodside 1042 days ago
You do, because it’s not just more training it’s PPO updates instead of MLE. It’s no longer trying to estimate the token distribution of the training corpus, it’s trying to shift logprobs into tokens that maximize expected reward from the RM. The GPT-4 technical report has a figure showing that logprobs become less well calibrated as confidence scores in the RLHF vs pre-train model.
1 comments

Fascinating, ty