Hacker News new | ask | show | jobs
by gwern 784 days ago
> If you trained on specific generated data with real distributions

It was trained on generated data from real distributions! The datasets LLMs are trained on include gigabytes of real data from real distributions, in addition to all of the code/stats/etc samples.

The question you should be asking is 'why did it stop being able to predict real distributions?' And we already know the answer: RLHF. https://news.ycombinator.com/item?id=40227082

1 comments

Do we know in any detail who provided the RLHF and according to what rules for any of these models?
No, not really. OA has been reticent to publish any real details about what RLHF GPT-4 and later models go through; while some models have been much more open, those weren't used in OP.

And it's unclear how easily you can interrogate their code/data to understand exactly how the RLHF goes wrong here - it seems unlikely that there are all that many raters rewarding conversations with heads rather than tails in hypothetical coinflips, so it's probably a more subtle issue of entropy collapse. (It's not that easy to understand why DL stuff does the stuff it does, and it's even more true that when it comes to RL stuff, it's much easier to observe outcomes than to understand how exactly the RL process yielded that outcome.)

So, we can see the effects before/after very clear in the OA Figure 8 graph in https://arxiv.org/pdf/2303.08774.pdf#page=12&org=openai on calibration, but I dunno if even they could tell you what exactly about the raters or PPO hyperparameters or whatever causes that.