|
|
|
|
|
by visarga
1284 days ago
|
|
> 2. Take thousands of prompts, generate several responses for each of them, and have human reviewers rank the responses for each prompt from best to worst Step 2 is not that. It's manually writing responses for a few tasks. > A labeller demonstrates the desired output behavior. (left side on https://cdn.openai.com/chatgpt/draft-20221129c/ChatGPT_Diagr...) So it is supervised training in this stage. Ranking is the next stage, for training the reward model. This is not the reward model, it's a model to generate sample responses to be used by the reward model. So there are two kinds of manual work involved here - manually demonstrating how to solve tasks, and ranking responses. There is even talk about how much effort to invest in the first vs the second and what is the trade-off. |
|