Hacker News new | ask | show | jobs
by sebzim4500 1208 days ago
>You really don't throw two sentences into the thunder dome to decide which one "wins".

That's almost literally what RLHF is though, and that is the last step of training GPT-n. Then when GPT-{n+1} is being trained, it will include some results from GPT-n, and therefore will benefit from that finetuning, even before it goes through its own round of RLHF. Also, on average good outputs of GPT-n are more likely to be included in the training set of GPT-{n+1} (because it ends up as a buzzfeed article or a top post on reddit or something), so there is an additional signal beyond the above.

2 comments

I suspect the comment about the thunder dome was a reference to RLHF. On the one hand RLHF seems far superior to the kind of prompt engineering Microsoft seems to have relied on with Sydney. On the other, it's dubious that the manual selection in RLHF is really always selecting for quality, as against at least to some significant extent pandering to whatever biases or preferences the humans in the training loop might have.
That not what RLHF is. In the thunderdome, as in chess, you don't need human judges or an oracle to know who's won. That makes a significant difference to the training procedure.