| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by goodside 1090 days ago
	You do, because it’s not just more training it’s PPO updates instead of MLE. It’s no longer trying to estimate the token distribution of the training corpus, it’s trying to shift logprobs into tokens that maximize expected reward from the RM. The GPT-4 technical report has a figure showing that logprobs become less well calibrated as confidence scores in the RLHF vs pre-train model.

1 comments

Fascinating, ty