Hacker News new | ask | show | jobs
by Glyptodon 49 days ago
What happens if you use an integer like 2 or 3 instead of e in the softmax equation? Is e what makes it so they end up summing to 1? (I have not done real math in yearssss.)
3 comments

It works the same way: softmax is essentially just applying the normalization to the vector exp(x). From an "engineering" POV this effectively ensures that the vector you normalize has strictly positive entries, so the result ends up being a proper distribution.

From a theory POV you get softmax like distributions (Gibbs distributions) by trying to balance following some energy E(x) and the entropy of the distribution. In essence the softmax is the answer to "I try to follow the maximum of a function E(x) but I need to maintain some level of uncertainy".

The balancing coefficient between entropy and picking the maximum of the function is called "temperature" (following the behavior of particles in a physical system: The colder the system, the lower the chance of having particles randomly walk away from the minimal energy state).

specifically, the temperature is

softmax(x/temp)

if you draw temp->0, your softmax slowly becomes an argmax (with temp=0 being a literal argmax). If you increase the temperature, you are closer to the "random fluctuations" leaving more room for sampling x values that are not the maximum of x. (this is why e.g. LLMs become deterministic as you decrease temp->0)

Using a different base other than e implicitly changes the temperature:

N^x = exp(ln(N) x)

The normalization works the same since you are still dividing a positive value N^x by the sum of all alternatives sum(N^x_i), which is a normalization by design

It's equivalent to multiplying all inputs by log b. And multiplying all inputs by a value changes how much the probabilities are extremized. This is easy to see because adding a value to everything doesn't change the output, so the biggest input can be assumed to be 0 and others negative. So multiplying by 0 makes all outputs equal while as the multiplier tends to infinity, all other inputs tend to -infinity and thus the biggest output tends to 1 and others to 0. Multiplying by negative numbers results in the lowest becoming the highest.
That's equivalent to changing the temperature.

Also, so long as the function is non-negative for all inputs and positive for at least one you'll always get a valid probability distribution.