|
|
|
|
|
by ylow
771 days ago
|
|
Indeed this is unsurprising given how LLMs work. I mean if you ask a human to generate a random number, and then reset the universe and all state of the human and ask again, you will get the same number. But instead if I ask it to generate 100 samples, it actually works pretty well. "You are a weighted random choice generator. About 80% of the time please say ‘left’ and about 20% of the time say ‘right’. Generate 100 samples of either "left" or "right". Do not say anything else. " I got 71 left, and 27 right. And if I ask for 50%, 50%. I get 56 lefts and 44 rights. |
|
It actually is surprising, and you should be surprised rather than post hoc justifying it, because the logits should reflect the true random probability and be calibrated in order to minimize the prediction loss. Putting ~100% weights on 'heads' is a terrible prediction!
And the LLM logits are in fact calibrated... before they go through RLHF and RLHF-derived dataset training. (Note that all of the models OP lists are either non-base tuned models like ChatGPT, or trained on data from such models, like Phi.) This was observed qualitatively when the 3.5 models were first released to the Playground, documented by the GPT-4 paper, and the 'flattened logits' phenomenon has been found many times since, not just by OP, and mostly by people totally ignorant of this phenomenon (despite being quite well known).
This is just one of those things, like BPE-related errors, that we're doomed to point out again and again in the Eternal September of LLMs.