| It works the same way:
softmax is essentially just applying the normalization to the vector exp(x).
From an "engineering" POV this effectively ensures that the vector you normalize has strictly positive entries, so the result ends up being a proper distribution. From a theory POV you get softmax like distributions (Gibbs distributions) by trying to balance following some energy E(x) and the entropy of the distribution.
In essence the softmax is the answer to "I try to follow the maximum of a function E(x) but I need to maintain some level of uncertainy". The balancing coefficient between entropy and picking the maximum of the function is called "temperature" (following the behavior of particles in a physical system: The colder the system, the lower the chance of having particles randomly walk away from the minimal energy state). specifically, the temperature is softmax(x/temp) if you draw temp->0, your softmax slowly becomes an argmax (with temp=0 being a literal argmax). If you increase the temperature, you are closer to the "random fluctuations" leaving more room for sampling x values that are not the maximum of x. (this is why e.g. LLMs become deterministic as you decrease temp->0) Using a different base other than e implicitly changes the temperature: N^x = exp(ln(N) x) The normalization works the same since you are still dividing a positive value N^x by the sum of all alternatives sum(N^x_i), which is a normalization by design |