Hacker News new | ask | show | jobs
by skierscott 2825 days ago
> a low confidence score

Neural nets should return a low confidence score. But, the popular approach (described below) ignores that. Neural nets ignore confidence because of a technique called softmax [1].

This happens as the final operation of a neural net, and is required for training.

Softmax is a tool to make an array of positive numbers look like a probability distribution:

    out = x / x.sum()
x[i] is a class prediction, but x.sum() != 1. Say if the network was uncertain, x[cat, dog] = [0.03, 0.01]. These are small values that do not imply great confidence (the network was trained on vectors with out.sum() = 1. The network would predict “dog” using softmax because out[dog] = 0.75 > 0.25 = out[cat].

But then in inference/prediction, the confidence is ignored. What if x.sum() is small? That would imply that the network is uncertain.

[1]: https://en.m.wikipedia.org/wiki/Softmax_function

3 comments

No. Regardless if the outputs for cat/dog are [0.03, 0.01] or [0.75, 0.25], the network is still three times more confident it's a cat. The uncertainty (entropy) of the outputs is exactly the same in both cases.

In other words, if you only have two object classes, the magnitude of the outputs does not matter, the uncertainty is measured by the relative difference of the outputs.

The only way to measure the confidence of the model that the output is "cat OR dog", is to have another class (e.g. "chair"), only then, looking at all three outputs you can estimate the confidence of the model regarding "cat OR dog" predictions (vs 'NOT (cat OR dog)"). For example, if [cat, dog, chair] outputs are [0.03, 0.01, 0.05] then we know the model is not confident that it's either a cat or a dog, but if the outputs are [0.75, 0.25, 0.05], then it's clear it is.

Is this what softmax is? Simply dividing a vector by sum of its components? If so, then how does it deserve a name, not to mention a long Wikipedia page full of formulas?
Softmax has two components:

1. Transform the components to e^x. This allows the neural network to work with logarithmic probabilities, instead of ordinary probabilities. This turns the common operation of multiplying probabilities into addition, which is far more natural for the linear algebra based structure of neural networks.

2. Normalize their sum to 1, since that's the total probability we need.

One important consequence of this is that bayes' theorem is very natural to such a network, since it's just multiplication of probabilities normalized by the denominator.

The trivial case of a single layer network with softmax activation is equivalent to logistic regression.

The special case of two component softmax is equivalent to sigmoid activation, which is thus popular when there are only two classes. In multi class classification softmax is used if the classes are mutually exclusive and component-wise sigmoid is used if they are independent.

Thanks for the detailed explanation!
It also includes the exponentiation step before the vector normalizations. There are connections to statistical mechanics here, where the relative energy population numbers are proportional to the softmax of the energy levels divided by temperature. (so as temperature goes up, the relative energy differences get smaller and the states are more equally populated.) That idea has been ported over as "softmax temperature" in some places.
No, it's not. It's actually e ^ x_i / sum(e ^ x_j for x_j in x), which is in fact different. Simply dividing by the sum wouldn't work for "squashing to a probability distribution" in a large number of cases.
So pointwise exponentiation composed with dividing by the sum. Still don't need a new word.
Coming from pure math, I often feel this way now learning statistics and ML. In pure math, it feels like the threshold for how a novel a concept should be before it gets its own word is much higher.

E.g, we have "regression" and "classification" instead of "supervised continuous prediction" and "supervised discrete prediction".

If you don't undestand where the name "softmax" came from, you don't really understand what it is. Softmax is a differentiable approximation of the max function.

Plot max(0, x) and softmax(0, x) functions, and it should become clear.

Nit: it seems it's more like a smooth approximation to maxarg than max.

Yeah it makes sense that this is a super important function, but I still feel like one could just remember the principle that "exponentiation followed by normalization is a smooth approximation to maxarg."

Basic building blocks of most deep learning models are convolutional layer, pooling layer, fully connected layer, and softmax layer. How do you propose we call "softmax layer" instead?
Normalization layer?

This opens up possibility of using something else than softmax in there.

Because grad students need to publish papers.
Note that you can get a form of confidence by just not applying softmax to the output during inference. Softmax is primarily to aid in training.
How well do neural networks train with no normalization at all, compared with softmax?
It's not about normalization, it's about loss function. Softmax is required by cross entropy minimization (negative log-likelihood to be precise), which works somewhat better in practice than mean squared error (MSE) minimization (which needs no normalization of outputs).
You need to perform some kind of normalization, since probability must be between 0 and 1 (and being wrong on a confident prediction gives huge penalties using the popular maximum likelyhood loss functions).

But you can use component wise normalization (sigmoid) instead of combined normalization (softmax). These correspond to the assumption that the classes are independent (component wise sigmoid) or mutually exclusive (softmax).

"probability must be between 0 and 1" - why? (I get it's used in mathematics, but I see no reason why a NN would have to output probability that way.)

"and being wrong on a confident prediction gives huge penalties using the popular maximum likelyhood loss functions" - It should.

I see no reason why a NN would have to output probability that way

For classification tasks, the labels are usually encoded as a one hot vector (one in the position of the correct class output, zeros everywhere else). If you don't normalize outputs to be between zero and one, it becomes a regression task - you are essentially asking the model to fit your one hot encoded label. That's not desirable, because we don't care about the actual value of the output for the correct class. Whether it is 0.1, 1.1 or 1001 it is the correct output as long as it's larger than outputs for other classes. That's why we want to take the largest output, and scale it in a way that it's always less than one. Its distance from one depends on how much larger it is than other outputs (the confidence of the model in this prediction).

Without normalization, the model that outputs 1000 for the correct class and tiny values for all other classes would get severely penalized because the labels says it should be 1 in that position (so the error is 1000-1=999), even though the model made the correct prediction.

There's some confusion about this (e.g. https://news.ycombinator.com/item?id=18054447 ), so hopefully my explanation makes sense.