|
|
|
|
|
by sebzim4500
1208 days ago
|
|
>You really don't throw two sentences into the thunder dome to decide which one "wins". That's almost literally what RLHF is though, and that is the last step of training GPT-n. Then when GPT-{n+1} is being trained, it will include some results from GPT-n, and therefore will benefit from that finetuning, even before it goes through its own round of RLHF. Also, on average good outputs of GPT-n are more likely to be included in the training set of GPT-{n+1} (because it ends up as a buzzfeed article or a top post on reddit or something), so there is an additional signal beyond the above. |
|