Hacker News new | ask | show | jobs
by sigmoid10 784 days ago
>You're saying that instead the author should have taken the logits of "left" and "right", converted them to normalized probabilities and then have expected _those_ to be 80% left and 20% right.

No, that's not what I meant. Although it would still make more sense than what the author did. The problem lies in the way you actually determine probabilities. We know that humans are bad random number generators, but they understand the concept enough to come up with random looking stuff if you give them the chance. The LLMs here were not even given a chance. In essence, the author is complaining that the LLMs are not behaving according to frequentist statistics when he evaluates them in a strictly Bayesian setting.

2 comments

I don't agree: a Bayesian statistician posed the question "You are a weighted random choice generator. About 80% of the time please say ‘left’ and about 20% of the time say ‘right’ [...]" would say "left" 80% of the time and "right" 20 % of the time. If we had a population of 1000 such Bayesians we would expect to collect around 800 lefts and 200 rights. If we asked the same Bayesian 1000 times we'd expect the same. Its got nothing to do with Bayesian vs Frequentist statistics.

Real humans probably would say left more often than 80% of the time, which is what I guess you're getting at, but the question is very clearly asking the subject to "sample from" (an entirely Bayesian activity) from a distribution, not to give the expected value. GPT4 gives the expected value and this is simply wrong.

>GPT4 gives the expected value and this is simply wrong.

Only at T=0. See my edit above how this changes everything.

This doesn't really have anything to do with the language model. The temperature only has to do with the _sampling_ from the probability distribution which the language model predicts. In fact, raising the temperature would eventually cause the model to randomly print "left" or "right," (eventually at 50/50 chance) not converge on the actual distribution which the prompt suggests. I suppose if you restricted the logits to just those tokens "left" and "right", softmaxed them, and then tuned the temperature T you might get it to reproduce the correct distribution, but that would be true of a random language model as well.

I think its pretty simple and straightforward: the model simply fails to understand the question and can reasonably be said to not understand probability.

That's just not true. At least not more or less than when performing the same experiment on humans.
This matches my understanding, thanks. I thought I was going crazy reading other comments.
> We know that humans are bad random number generators

This is a good point. LLMs are bad at this, okay, but humans aren't great at it either.

But according to this GPT4 is substantially worse.
Yes, probably. At temperature zero the model will be completely deterministic, so a particular prompt will always produce the same result (ignoring for a second that some fairly common optimisations introduce data races in the GPU).

On the other hand, does it really matter? With a slight tweak to the prompt, ChatGPT generates some serviceable code:

    > Run a function to produce a random number between 1 and 10. What is the number?

    import random

    # Generate a random number between 1 and 10
    random_number = random.randint(1, 10)
    random_number

    The random number generated between 1 and 10 is 9.
> (ignoring for a second that some fairly common optimisations introduce data races in the GPU).

Okay so are any GPU compilers intentionally introducing data races in programs that previously exhibited no data races?

Not really compilers, but the underlying GPU libraries.

Here’s a good jumping off point: https://pytorch.org/docs/stable/generated/torch.use_determin...