| The way they went from GPT-3 to ChatGPT is really quite genius. My understanding is that it's something like this: 1. Start with GPT-3, which predicts the next word in some text and is trained on all the text on the internet 2. Take thousands of prompts, generate several responses for each of them, and have human reviewers rank the responses for each prompt from best to worst 3. The GPT model needs a massive amount of training data, it would be cost prohibitive to get enough human feedback to fine tune GPT manually. So you train another model, called the reward model, to predict how the humans will rate each response. Then you train the GPT model against the reward model millions of times 5. Feed a small percentage of the output from that training process back to the human reviewers to continue training the reward model, based on heuristics like reward model uncertainty which predict how helpful the human feedback will be towards improving the reward model 6. Release ChatGPT to the public, and use user feedback like response upvotes/downvotes to further optimize the reward model, while continuing to train ChatGPT against the reward model https://openai.com/blog/chatgpt/ https://openai.com/blog/deep-reinforcement-learning-from-hum... |
Step 2 is not that. It's manually writing responses for a few tasks.
> A labeller demonstrates the desired output behavior.
(left side on https://cdn.openai.com/chatgpt/draft-20221129c/ChatGPT_Diagr...)
So it is supervised training in this stage. Ranking is the next stage, for training the reward model. This is not the reward model, it's a model to generate sample responses to be used by the reward model.
So there are two kinds of manual work involved here - manually demonstrating how to solve tasks, and ranking responses. There is even talk about how much effort to invest in the first vs the second and what is the trade-off.