Hacker News new | ask | show | jobs
by WhitneyLand 529 days ago
> the output token of the LLM (black box) is not deterministic. Rather, it is a probability distribution over all the available tokens

How is this not deterministic? Randomness is intentionally added via temperature.

6 comments

"Temperature" doesn't make sense unless your model is predicting a distribution. You can't "temperature sample" a calculator, for instance. The output of the LLM is a predictive distribution over the next token; this is the formulation you will see in every paper on LLMs. It's true that you can do various things with that distribution other than sampling it: you can compute its entropy, you can find its mode (argmax), etc., but the type signature of the LLM itself is `prompt -> probability distribution over next tokens`.
The temperature in LLMs is a parameter of a regularization step that determines how neuron activation levels get mapped to odds ratios.

Zero temperature => fully deterministic

The neuron activation levels do not inherently form or represent a probability distribution. That's something we've slapped on after the fact

Any interpretation (including interpreting the inputs to the neural net as a "prompt") is "slapped on" in some sense—at some level, it's all just numbers being added, multiplied, and so on.

But I wouldn't call the probabilistic interpretation "after the fact." The entire training procedure that generated the LM weights (the pre-training as well as the RLHF post-training) is formulated based on the understanding that the LM predicts p(x_t | x_1, ..., x_{t-1}). For example, pretraining maximizes the log probability of the training data, and RLHF typically maximizes an objective that combines "expected reward [under the LLM's output probability distribution]" with "KL divergence between the pretraining distribution and the RLHF'd distribution" (a probabilistic quantity).

Under a crossentropy loss the output activations do absolutely represent a probability distribution, since that is what we're modeling.
The output distribution is deterministic, the output token is sampled from the output distribution, and is therefore not deterministic. Temperature modulates the output distribution, but sitting it to 0 (i.e. argmax sampling) is not the norm.
Running temperature of zero/greedy sampling (what you call "argmax sampling") is EXTREMELY common.

LLMs are basically "deterministic" when using greedy sampling except for either MoE related shenanigans (what historically prevented determinism in ChatGPT) or due to floating point related issues (GPU related). In practice, LLMs are in fact basically "deterministic" except for the sampling/temperature stuff that we add at the very end.

> except for either MoE related shenanigans (what historically prevented determinism in ChatGPT)

The original ChatCPT was based on GPT-3.5, which did not use MoE.

There's extra randomness added accidentally in practice: inference is a massively parallelized set of matrix multiplications, and floating point math is not commutative - the randomness in execution order gets converted into a random FP error, so even setting temperature to 0 doesn't guarantee repeatable results.
Only if the inference software doesn't guarantee concurrency, which is CS 101
This sort of nondeterministic scheduling of non associative floating point ops is essentially running at the level of GPU firmware, so, I would imagine that in this case, Nvidia is aware.
The output "token"

Yes, you can sample deterministically, but that's some combination of computationally intractable and only useful on a small subset of problems. The black box outputting a non-deterministic token is a close enough approximation for most people.

The author of the article seems confused, saying:

"The important thing to remember is that the output token of the LLM (black box) is not deterministic. Rather, it is a probability distribution over all the available tokens in the vocabulary."

He is saying that there is non-determinism in the output of the LLM (i.e. in these probability distributions), when in fact the randomness only comes from choosing to use a random number generator to sample from this output.

The author is saying that the output token is not deterministic. I don't think they said the distribution was stochastic.

Even so the distribution of the second token output by the model would be stochastic (unless you condition on the first token). So in that sense there may also be a stochastic probability distribution.

Mostly unrelated (I agree with you, and I'm some ancestory comment you're responding to with the same line of thinking), I have built a couple LLMs where the distribution itself is stochastic. That's not key to how they work as a black box, but much like how quicksort has certain performance characteristics I did find it advantageous to introduce randomness into the model itself.

You could still easily model the next token as a conditional probability distribution though if you wanted; the computation of entropy just might be a bit spendier.

Author here: Yes. You are right. I was meaning to paint a picture that instead of the next token appearing magically, it is sampled from a probability distribution. The notion of determinism could be explained differently. Thanks for pointing it out!
Entropy is also added via a random seed. The model is only deterministic if you use the same random seed.
I think you're confusing training and inference. During training there are things like initialization, data shuffling and dropout that depend on random numbers. At inference time these don't apply.
Decoding (sampling) uses (pseudo) random numbers. Otherwise same prompt would always give the same response.

Computing entropy generally does not.

See e.g. https://huggingface.co/blog/how-to-generate

Sure - but that's not the output of the model itself, that's the process of (typically) randomly sampling from the output of the model.
Right, sampling from a model, also known as *inference* (for LLM's).

The inference here is perhaps less pure than what you refer to but you're talking to human beings; there's no need for heavy pedantry.