Do LLM's always pick the most probable next word? I would have thought this would lead to having the same output for every input? How does this deal with the randomness that you get from prompting the same thing over and over?
The most typical reason argmax (temp 0) is non-deterministic is that your request is running batched with other people requests. The number and size of these affects the matrix sizes and thus tiling decisions. Then you get different floating point order and thus different results.
Nvidia gives some guarantees about deterministic results of their kernels but that only applies when you have exact same input data and this is not the case when in-flight batching.
It depends. If we use beam search we pick the most likely sequence of tokens rather than the most likely token at each point in time. This process is deterministic though.
We can also sample from the distribution, which introduces randomness. Basically, if word1 should be chosen 75% of the time and word2 25% of the time, it will do that.
The randomness you’re seeing can also be due to implementation details.
It doesn't get you perfectly deterministic output to set it to 0 though, per https://medium.com/google-cloud/is-a-zero-temperature-determ... as you don't have perfect control over what approximations are being made on your floating point operations