| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by TeMPOraL 2825 days ago
	Is this what softmax is? Simply dividing a vector by sum of its components? If so, then how does it deserve a name, not to mention a long Wikipedia page full of formulas?

5 comments

CodesInChaos 2825 days ago

Softmax has two components:

1. Transform the components to e^x. This allows the neural network to work with logarithmic probabilities, instead of ordinary probabilities. This turns the common operation of multiplying probabilities into addition, which is far more natural for the linear algebra based structure of neural networks.

2. Normalize their sum to 1, since that's the total probability we need.

One important consequence of this is that bayes' theorem is very natural to such a network, since it's just multiplication of probabilities normalized by the denominator.

The trivial case of a single layer network with softmax activation is equivalent to logistic regression.

The special case of two component softmax is equivalent to sigmoid activation, which is thus popular when there are only two classes. In multi class classification softmax is used if the classes are mutually exclusive and component-wise sigmoid is used if they are independent.

link

TeMPOraL 2824 days ago

Thanks for the detailed explanation!

link

brilee 2825 days ago

It also includes the exponentiation step before the vector normalizations. There are connections to statistical mechanics here, where the relative energy population numbers are proportional to the softmax of the energy levels divided by temperature. (so as temperature goes up, the relative energy differences get smaller and the states are more equally populated.) That idea has been ported over as "softmax temperature" in some places.

link

Tenobrus 2825 days ago

No, it's not. It's actually e ^ x_i / sum(e ^ x_j for x_j in x), which is in fact different. Simply dividing by the sum wouldn't work for "squashing to a probability distribution" in a large number of cases.

link

throwaway080383 2825 days ago

So pointwise exponentiation composed with dividing by the sum. Still don't need a new word.

link

throwaway080383 2825 days ago

Coming from pure math, I often feel this way now learning statistics and ML. In pure math, it feels like the threshold for how a novel a concept should be before it gets its own word is much higher.

E.g, we have "regression" and "classification" instead of "supervised continuous prediction" and "supervised discrete prediction".

link

p1esk 2824 days ago

If you don't undestand where the name "softmax" came from, you don't really understand what it is. Softmax is a differentiable approximation of the max function.

Plot max(0, x) and softmax(0, x) functions, and it should become clear.

link

throwaway080383 2824 days ago

Nit: it seems it's more like a smooth approximation to maxarg than max.

Yeah it makes sense that this is a super important function, but I still feel like one could just remember the principle that "exponentiation followed by normalization is a smooth approximation to maxarg."

link

p1esk 2824 days ago

Basic building blocks of most deep learning models are convolutional layer, pooling layer, fully connected layer, and softmax layer. How do you propose we call "softmax layer" instead?

link

TeMPOraL 2824 days ago

Normalization layer?

This opens up possibility of using something else than softmax in there.

link

p1esk 2824 days ago

Well, there are other building blocks, such as batch normalization layer, or local contrast normalization layer (not to mention a dozen of batchnorm alternatives, e.g. group normalization, weight normalization, layer normalization, instance normalization, etc).

If you just say "normalization layer" how am I supposed to know which normalization you're talking about?

link

hendzen 2825 days ago

Because grad students need to publish papers.

link