Hacker News new | ask | show | jobs
by knightoffaith 810 days ago
>Non-determinism is an implementation detail, not an intrinsic property, as I understand it (at least as long as you're setting temperature to zero).

Right. A transformer outputs a probability distribution over all possible tokens from which the next token is sampled and then appended to the input sequence, at which point the process repeats. Temperature controls the entropy of the distribution - higher temperature, higher entropy, conversely, lower temperature, lower temperature. Technically zero temperature involves dividing by zero, so under the hood it's simply set to be an epsilon so small that the entropy of the distribution is low enough that sampling from it always effectively gives one token - the token with the highest probability. And so at every step in inference, the highest probability token is emitted.