Hacker News new | ask | show | jobs
by ylow 771 days ago
Indeed this is unsurprising given how LLMs work. I mean if you ask a human to generate a random number, and then reset the universe and all state of the human and ask again, you will get the same number.

But instead if I ask it to generate 100 samples, it actually works pretty well.

"You are a weighted random choice generator. About 80% of the time please say ‘left’ and about 20% of the time say ‘right’. Generate 100 samples of either "left" or "right". Do not say anything else. "

I got 71 left, and 27 right.

And if I ask for 50%, 50%. I get 56 lefts and 44 rights.

2 comments

> Indeed this is unsurprising given how LLMs work. I mean if you ask a human to generate a random number, and then reset the universe and all state of the human and ask again, you will get the same number.

It actually is surprising, and you should be surprised rather than post hoc justifying it, because the logits should reflect the true random probability and be calibrated in order to minimize the prediction loss. Putting ~100% weights on 'heads' is a terrible prediction!

And the LLM logits are in fact calibrated... before they go through RLHF and RLHF-derived dataset training. (Note that all of the models OP lists are either non-base tuned models like ChatGPT, or trained on data from such models, like Phi.) This was observed qualitatively when the 3.5 models were first released to the Playground, documented by the GPT-4 paper, and the 'flattened logits' phenomenon has been found many times since, not just by OP, and mostly by people totally ignorant of this phenomenon (despite being quite well known).

This is just one of those things, like BPE-related errors, that we're doomed to point out again and again in the Eternal September of LLMs.

> Putting ~100% weights on 'heads' is a terrible prediction!

For a weighted coin, isn't this the optimal strategy in the absence of other information? `p > p^2 + ( 1 − p )^2`.

No, because you're confusing loss functions: a LLM makes a probabilistic prediction, not a hard decision. That is the optimal strategy only if you have something like a 0-1 loss function†, akin to betting on a coin flip, which is not a proper scoring rule (and not easily differentiable either).

Whereas LLMs are usually trained with a proper scoring rule which incentivizes them to report calibrated predictions, like mean squared error. For that, the optimal prediction is just '50%', perhaps transformed into log-odds, and whatever the equivalent of '50%' is over the BPE vocabulary.

† eg if you are betting $1 on whether heads or tails come up, it is true that you can't do better than always betting $1 on the side with P>50% - and strikingly, this is not what people do in setups like the spinner game (or Twitter polls), they 'probability match', which is optimal in terms of Thompson sampling, as if they were playing a indefinitely-long repeated bandit to minimize regret. I usually take this as an example of System I vs System II: showing how hard it is to break our real-world-appropriate intuitive behavior in artificial game setups. If you think about it, in the usual spinner-game, probability matching is just straightforwardly wrong and it's not like a bandit at all; but you do have to think about it.

(Yes 71 + 27 != 100, but that LLMs can't count is a whole other issue)