Hacker News new | ask | show | jobs
by sigmoid10 784 days ago
>You are a weighted random choice generator. About 80% of the time please say ‘left’ and about 20% of the time say ‘right’. Simply reply with left or right. Do not say anything else

I could have told you these results solely based on the methodology combined with this system prompt. No need to spend money on APIs. Randomness in LLMs does not come from the context, it comes from sampling over output tokens the LLM considers likely. Imagine you are in this situation as a human: Someone walks up to you and tells you to say "left" with 80% probability and "right" with 20% probability. You say "left" and then the other person walks away never to be seen again. How do you determine if your own "output" was correct? You would need to sample it many times in the same conversation before anyone could determine wether you understand the basics of probability or not. This is an issue of the author's understanding of Bayesian statistics and possibly a misunderstanding of how LLMs actually work.

Edit:

I just tried a minimally more sensible approach after getting an idea from the comments below. I asked GPT4 to generate a random number using this prompt:

>You are a random number generator. Reply with a number between 0 and 10. Only say the number, say nothing else.

It responed with 7. But then I looked at the top logprobs. Sure enough, they contained all the remaining numbers between 0 and 10. The only issue is that "7" got a logprob of -0.008539278, while the next most likely was "4" at -5.5371723, which is significantly lower. The remaining probs were then pretty close to each other. Unfortunately, OpenAI doesn't allow you to crank the temperature up arbitrarily high, otherwise the original experiment would actually work. And I would argue that humans will still fail at this if you used the same methodology. The reason I didn't use OP's exact approach is because if you look at the logprobs there, you'll see they get muddled with tokens that are just different spellings of left and right (such as "Left" or "-left"). But the model definitely understands the concept of probability, it would just need more context before you can do any reasonable frequentist analysis in a single conversation.

Edit 2:

I repeated it with random numbers between 0 and 100. Guess what numbers are coming out among the top logprobs. Pretty much exactly what you'd expect after watching this: https://www.youtube.com/watch?v=d6iQrh2TK98

I guess LLMs trained on human data think pretty similar to humans after all.

2 comments

You're saying that instead the author should have taken the logits of "left" and "right", converted them to normalized probabilities and then have expected _those_ to be 80% left and 20% right. But if this were the case (under some reasonable assumptions about the sampling methodology of the providers) then the author would have seen an 80/20 split. From these results we can probably conclude that with this prompt the predicted probability for "left" is near 100% for GPT4.

I think the author's point stands. They aren't asking "what would you expect from a distribution so described?" The answer to that question is 100% of the time "left.". A well behaving LLM responding to the actual question should distribute the logits across "left" and "right" in the way requested by the user and doesn't.

I think if you chose 1000 random people and prompted them with this question you would get a preponderance of "lefts" compared to the prompt, but not 100% left.

>You're saying that instead the author should have taken the logits of "left" and "right", converted them to normalized probabilities and then have expected _those_ to be 80% left and 20% right.

No, that's not what I meant. Although it would still make more sense than what the author did. The problem lies in the way you actually determine probabilities. We know that humans are bad random number generators, but they understand the concept enough to come up with random looking stuff if you give them the chance. The LLMs here were not even given a chance. In essence, the author is complaining that the LLMs are not behaving according to frequentist statistics when he evaluates them in a strictly Bayesian setting.

I don't agree: a Bayesian statistician posed the question "You are a weighted random choice generator. About 80% of the time please say ‘left’ and about 20% of the time say ‘right’ [...]" would say "left" 80% of the time and "right" 20 % of the time. If we had a population of 1000 such Bayesians we would expect to collect around 800 lefts and 200 rights. If we asked the same Bayesian 1000 times we'd expect the same. Its got nothing to do with Bayesian vs Frequentist statistics.

Real humans probably would say left more often than 80% of the time, which is what I guess you're getting at, but the question is very clearly asking the subject to "sample from" (an entirely Bayesian activity) from a distribution, not to give the expected value. GPT4 gives the expected value and this is simply wrong.

>GPT4 gives the expected value and this is simply wrong.

Only at T=0. See my edit above how this changes everything.

This doesn't really have anything to do with the language model. The temperature only has to do with the _sampling_ from the probability distribution which the language model predicts. In fact, raising the temperature would eventually cause the model to randomly print "left" or "right," (eventually at 50/50 chance) not converge on the actual distribution which the prompt suggests. I suppose if you restricted the logits to just those tokens "left" and "right", softmaxed them, and then tuned the temperature T you might get it to reproduce the correct distribution, but that would be true of a random language model as well.

I think its pretty simple and straightforward: the model simply fails to understand the question and can reasonably be said to not understand probability.

That's just not true. At least not more or less than when performing the same experiment on humans.
This matches my understanding, thanks. I thought I was going crazy reading other comments.
> We know that humans are bad random number generators

This is a good point. LLMs are bad at this, okay, but humans aren't great at it either.

But according to this GPT4 is substantially worse.
Yes, probably. At temperature zero the model will be completely deterministic, so a particular prompt will always produce the same result (ignoring for a second that some fairly common optimisations introduce data races in the GPU).

On the other hand, does it really matter? With a slight tweak to the prompt, ChatGPT generates some serviceable code:

    > Run a function to produce a random number between 1 and 10. What is the number?

    import random

    # Generate a random number between 1 and 10
    random_number = random.randint(1, 10)
    random_number

    The random number generated between 1 and 10 is 9.
> (ignoring for a second that some fairly common optimisations introduce data races in the GPU).

Okay so are any GPU compilers intentionally introducing data races in programs that previously exhibited no data races?

> A well behaving LLM responding to the actual question should distribute the logits across "left" and "right" in the way requested by the user and doesn't.

No, a well-behaving LLM would do exactly what's seen. The most likely next toxen is "left" and it should deterministically output that unless some other layer like a temperature function makes it non-deterministic in its own way (wholly unrelated to the prompt).

The fantastical AGI precursor that people have been coached into seeing is what you're talking about, and that's (of course) not what an LLM actually is.

This is essentially just one of the easier ways you can expose the parlor trick behind that misconception.

This simply doesn't follow. One could totally train an LLM to assign the right logits to "left" and "right" for this problem. I suspect its a problem with the training data.
> Randomness in LLMs does not come from the context, it comes from sampling over output tokens the LLM considers likely.

I mean, theoretically I assume you could train an LLM so that for the input "Choose a random number between 1 and 6" output tokens 1, 2, 3, 4, 5 and 6 are equally likely. Then the sampling process would produce a random number.

Of course, whether you could teach the model to generalise that more broadly is a different matter.