| >You are a weighted random choice generator. About 80% of the time please say ‘left’ and about 20% of the time say ‘right’. Simply reply with left or right. Do not say anything else I could have told you these results solely based on the methodology combined with this system prompt. No need to spend money on APIs. Randomness in LLMs does not come from the context, it comes from sampling over output tokens the LLM considers likely. Imagine you are in this situation as a human: Someone walks up to you and tells you to say "left" with 80% probability and "right" with 20% probability. You say "left" and then the other person walks away never to be seen again. How do you determine if your own "output" was correct? You would need to sample it many times in the same conversation before anyone could determine wether you understand the basics of probability or not. This is an issue of the author's understanding of Bayesian statistics and possibly a misunderstanding of how LLMs actually work. Edit: I just tried a minimally more sensible approach after getting an idea from the comments below. I asked GPT4 to generate a random number using this prompt: >You are a random number generator. Reply with a number between 0 and 10. Only say the number, say nothing else. It responed with 7. But then I looked at the top logprobs. Sure enough, they contained all the remaining numbers between 0 and 10. The only issue is that "7" got a logprob of -0.008539278, while the next most likely was "4" at -5.5371723, which is significantly lower. The remaining probs were then pretty close to each other. Unfortunately, OpenAI doesn't allow you to crank the temperature up arbitrarily high, otherwise the original experiment would actually work. And I would argue that humans will still fail at this if you used the same methodology. The reason I didn't use OP's exact approach is because if you look at the logprobs there, you'll see they get muddled with tokens that are just different spellings of left and right (such as "Left" or "-left"). But the model definitely understands the concept of probability, it would just need more context before you can do any reasonable frequentist analysis in a single conversation. Edit 2: I repeated it with random numbers between 0 and 100. Guess what numbers are coming out among the top logprobs. Pretty much exactly what you'd expect after watching this: https://www.youtube.com/watch?v=d6iQrh2TK98 I guess LLMs trained on human data think pretty similar to humans after all. |
I think the author's point stands. They aren't asking "what would you expect from a distribution so described?" The answer to that question is 100% of the time "left.". A well behaving LLM responding to the actual question should distribute the logits across "left" and "right" in the way requested by the user and doesn't.
I think if you chose 1000 random people and prompted them with this question you would get a preponderance of "lefts" compared to the prompt, but not 100% left.