| HN Mirror

No, not really. OA has been reticent to publish any real details about what RLHF GPT-4 and later models go through; while some models have been much more open, those weren't used in OP.

And it's unclear how easily you can interrogate their code/data to understand exactly how the RLHF goes wrong here - it seems unlikely that there are all that many raters rewarding conversations with heads rather than tails in hypothetical coinflips, so it's probably a more subtle issue of entropy collapse. (It's not that easy to understand why DL stuff does the stuff it does, and it's even more true that when it comes to RL stuff, it's much easier to observe outcomes than to understand how exactly the RL process yielded that outcome.)

So, we can see the effects before/after very clear in the OA Figure 8 graph in https://arxiv.org/pdf/2303.08774.pdf#page=12&org=openai on calibration, but I dunno if even they could tell you what exactly about the raters or PPO hyperparameters or whatever causes that.