Hacker News new | ask | show | jobs
by robertclaus 784 days ago
I wonder if you could actually fine tune an LLM to do better on this. As some of the comments point out, the issue here is that the possible output probabilities combined with the model temperature don't actually result in the probabilities requested in the prompt. If you trained on specific generated data with real distributions would it learn to compensate appropriately? Would that carry over to novel probability prompts?
3 comments

Almost certainly not if you set the temperature of the model to 0, since then the output would be deterministic minus MoE stuff.

If the temperature was not zero, then it seems technically possible for the output tokens to weighted closely enough in probability to each other in a way such that the randomization from temperature causes tokens to be printed in the appropriate distribution.

However, I'm not an LLM expert, but I don't think that people use a "temperature" while training the model. Thus the training step would not be able to learn how to output tokens in the given distribution with a given temperature because the training step does not have access to the temperature the user is using.

EDIT: I made the assumption that the LLM was not asked for a sequence of random numbers, but only one number per prompt. I think this fits the use case described in the article, but another use case might be asking for a sequence of such numbers, in which case training might work.

> If you trained on specific generated data with real distributions

It was trained on generated data from real distributions! The datasets LLMs are trained on include gigabytes of real data from real distributions, in addition to all of the code/stats/etc samples.

The question you should be asking is 'why did it stop being able to predict real distributions?' And we already know the answer: RLHF. https://news.ycombinator.com/item?id=40227082

Do we know in any detail who provided the RLHF and according to what rules for any of these models?
No, not really. OA has been reticent to publish any real details about what RLHF GPT-4 and later models go through; while some models have been much more open, those weren't used in OP.

And it's unclear how easily you can interrogate their code/data to understand exactly how the RLHF goes wrong here - it seems unlikely that there are all that many raters rewarding conversations with heads rather than tails in hypothetical coinflips, so it's probably a more subtle issue of entropy collapse. (It's not that easy to understand why DL stuff does the stuff it does, and it's even more true that when it comes to RL stuff, it's much easier to observe outcomes than to understand how exactly the RL process yielded that outcome.)

So, we can see the effects before/after very clear in the OA Figure 8 graph in https://arxiv.org/pdf/2303.08774.pdf#page=12&org=openai on calibration, but I dunno if even they could tell you what exactly about the raters or PPO hyperparameters or whatever causes that.

Probably yes. You could also garnish the prompt with a vanilla RNG output.