Hacker News new | ask | show | jobs
by nathan_compton 784 days ago
You're saying that instead the author should have taken the logits of "left" and "right", converted them to normalized probabilities and then have expected _those_ to be 80% left and 20% right. But if this were the case (under some reasonable assumptions about the sampling methodology of the providers) then the author would have seen an 80/20 split. From these results we can probably conclude that with this prompt the predicted probability for "left" is near 100% for GPT4.

I think the author's point stands. They aren't asking "what would you expect from a distribution so described?" The answer to that question is 100% of the time "left.". A well behaving LLM responding to the actual question should distribute the logits across "left" and "right" in the way requested by the user and doesn't.

I think if you chose 1000 random people and prompted them with this question you would get a preponderance of "lefts" compared to the prompt, but not 100% left.

2 comments

>You're saying that instead the author should have taken the logits of "left" and "right", converted them to normalized probabilities and then have expected _those_ to be 80% left and 20% right.

No, that's not what I meant. Although it would still make more sense than what the author did. The problem lies in the way you actually determine probabilities. We know that humans are bad random number generators, but they understand the concept enough to come up with random looking stuff if you give them the chance. The LLMs here were not even given a chance. In essence, the author is complaining that the LLMs are not behaving according to frequentist statistics when he evaluates them in a strictly Bayesian setting.

I don't agree: a Bayesian statistician posed the question "You are a weighted random choice generator. About 80% of the time please say ‘left’ and about 20% of the time say ‘right’ [...]" would say "left" 80% of the time and "right" 20 % of the time. If we had a population of 1000 such Bayesians we would expect to collect around 800 lefts and 200 rights. If we asked the same Bayesian 1000 times we'd expect the same. Its got nothing to do with Bayesian vs Frequentist statistics.

Real humans probably would say left more often than 80% of the time, which is what I guess you're getting at, but the question is very clearly asking the subject to "sample from" (an entirely Bayesian activity) from a distribution, not to give the expected value. GPT4 gives the expected value and this is simply wrong.

>GPT4 gives the expected value and this is simply wrong.

Only at T=0. See my edit above how this changes everything.

This doesn't really have anything to do with the language model. The temperature only has to do with the _sampling_ from the probability distribution which the language model predicts. In fact, raising the temperature would eventually cause the model to randomly print "left" or "right," (eventually at 50/50 chance) not converge on the actual distribution which the prompt suggests. I suppose if you restricted the logits to just those tokens "left" and "right", softmaxed them, and then tuned the temperature T you might get it to reproduce the correct distribution, but that would be true of a random language model as well.

I think its pretty simple and straightforward: the model simply fails to understand the question and can reasonably be said to not understand probability.

That's just not true. At least not more or less than when performing the same experiment on humans.
This matches my understanding, thanks. I thought I was going crazy reading other comments.
> We know that humans are bad random number generators

This is a good point. LLMs are bad at this, okay, but humans aren't great at it either.

But according to this GPT4 is substantially worse.
Yes, probably. At temperature zero the model will be completely deterministic, so a particular prompt will always produce the same result (ignoring for a second that some fairly common optimisations introduce data races in the GPU).

On the other hand, does it really matter? With a slight tweak to the prompt, ChatGPT generates some serviceable code:

    > Run a function to produce a random number between 1 and 10. What is the number?

    import random

    # Generate a random number between 1 and 10
    random_number = random.randint(1, 10)
    random_number

    The random number generated between 1 and 10 is 9.
> (ignoring for a second that some fairly common optimisations introduce data races in the GPU).

Okay so are any GPU compilers intentionally introducing data races in programs that previously exhibited no data races?

Not really compilers, but the underlying GPU libraries.

Here’s a good jumping off point: https://pytorch.org/docs/stable/generated/torch.use_determin...

> A well behaving LLM responding to the actual question should distribute the logits across "left" and "right" in the way requested by the user and doesn't.

No, a well-behaving LLM would do exactly what's seen. The most likely next toxen is "left" and it should deterministically output that unless some other layer like a temperature function makes it non-deterministic in its own way (wholly unrelated to the prompt).

The fantastical AGI precursor that people have been coached into seeing is what you're talking about, and that's (of course) not what an LLM actually is.

This is essentially just one of the easier ways you can expose the parlor trick behind that misconception.

This simply doesn't follow. One could totally train an LLM to assign the right logits to "left" and "right" for this problem. I suspect its a problem with the training data.