Hacker News new | ask | show | jobs
by visarga 1284 days ago
> 2. Take thousands of prompts, generate several responses for each of them, and have human reviewers rank the responses for each prompt from best to worst

Step 2 is not that. It's manually writing responses for a few tasks.

> A labeller demonstrates the desired output behavior.

(left side on https://cdn.openai.com/chatgpt/draft-20221129c/ChatGPT_Diagr...)

So it is supervised training in this stage. Ranking is the next stage, for training the reward model. This is not the reward model, it's a model to generate sample responses to be used by the reward model.

So there are two kinds of manual work involved here - manually demonstrating how to solve tasks, and ranking responses. There is even talk about how much effort to invest in the first vs the second and what is the trade-off.

1 comments

Right I intentionally left off Step 1 from that chart to simplify the explanation, since it didn't seem necessary. Is Step 1 just for creating the ChatGPT content blocker?