| HN Mirror

pre-training is developing the language model's base understanding of conditional word probabilities.

SFT and RLHF is attempting to further guide the model in terms of steerability + alignment of output.

In fact, the InstructGPT authors were worried about losing the pre-trained model's underlying probability distribution, so they try a version where it penalizes the model deviating too significantly from the original distribution (using KL). I don't remember them seeing a significant difference in performance.