"Temperature" doesn't make sense unless your model is predicting a distribution. You can't "temperature sample" a calculator, for instance. The output of the LLM is a predictive distribution over the next token; this is the formulation you will see in every paper on LLMs. It's true that you can do various things with that distribution other than sampling it: you can compute its entropy, you can find its mode (argmax), etc., but the type signature of the LLM itself is `prompt -> probability distribution over next tokens`.
Any interpretation (including interpreting the inputs to the neural net as a "prompt") is "slapped on" in some senseāat some level, it's all just numbers being added, multiplied, and so on.
But I wouldn't call the probabilistic interpretation "after the fact." The entire training procedure that generated the LM weights (the pre-training as well as the RLHF post-training) is formulated based on the understanding that the LM predicts p(x_t | x_1, ..., x_{t-1}). For example, pretraining maximizes the log probability of the training data, and RLHF typically maximizes an objective that combines "expected reward [under the LLM's output probability distribution]" with "KL divergence between the pretraining distribution and the RLHF'd distribution" (a probabilistic quantity).
The output distribution is deterministic, the output token is sampled from the output distribution, and is therefore not deterministic.
Temperature modulates the output distribution, but sitting it to 0 (i.e. argmax sampling) is not the norm.
Running temperature of zero/greedy sampling (what you call "argmax sampling") is EXTREMELY common.
LLMs are basically "deterministic" when using greedy sampling except for either MoE related shenanigans (what historically prevented determinism in ChatGPT) or due to floating point related issues (GPU related). In practice, LLMs are in fact basically "deterministic" except for the sampling/temperature stuff that we add at the very end.
There's extra randomness added accidentally in practice: inference is a massively parallelized set of matrix multiplications, and floating point math is not commutative - the randomness in execution order gets converted into a random FP error, so even setting temperature to 0 doesn't guarantee repeatable results.
This sort of nondeterministic scheduling of non associative floating point ops is essentially running at the level of GPU firmware, so, I would imagine that in this case, Nvidia is aware.
Yes, you can sample deterministically, but that's some combination of computationally intractable and only useful on a small subset of problems. The black box outputting a non-deterministic token is a close enough approximation for most people.
"The important thing to remember is that the output token of the LLM (black box) is not deterministic. Rather, it is a probability distribution over all the available tokens in the vocabulary."
He is saying that there is non-determinism in the output of the LLM (i.e. in these probability distributions), when in fact the randomness only comes from choosing to use a random number generator to sample from this output.
The author is saying that the output token is not deterministic. I don't think they said the distribution was stochastic.
Even so the distribution of the second token output by the model would be stochastic (unless you condition on the first token). So in that sense there may also be a stochastic probability distribution.
Mostly unrelated (I agree with you, and I'm some ancestory comment you're responding to with the same line of thinking), I have built a couple LLMs where the distribution itself is stochastic. That's not key to how they work as a black box, but much like how quicksort has certain performance characteristics I did find it advantageous to introduce randomness into the model itself.
You could still easily model the next token as a conditional probability distribution though if you wanted; the computation of entropy just might be a bit spendier.
Author here: Yes. You are right. I was meaning to paint a picture that instead of the next token appearing magically, it is sampled from a probability distribution. The notion of determinism could be explained differently. Thanks for pointing it out!
I think you're confusing training and inference. During training there are things like initialization, data shuffling and dropout that depend on random numbers. At inference time these don't apply.