|
|
|
|
|
by p1esk
2824 days ago
|
|
I see no reason why a NN would have to output probability that way For classification tasks, the labels are usually encoded as a one hot vector (one in the position of the correct class output, zeros everywhere else). If you don't normalize outputs to be between zero and one, it becomes a regression task - you are essentially asking the model to fit your one hot encoded label. That's not desirable, because we don't care about the actual value of the output for the correct class. Whether it is 0.1, 1.1 or 1001 it is the correct output as long as it's larger than outputs for other classes. That's why we want to take the largest output, and scale it in a way that it's always less than one. Its distance from one depends on how much larger it is than other outputs (the confidence of the model in this prediction). Without normalization, the model that outputs 1000 for the correct class and tiny values for all other classes would get severely penalized because the labels says it should be 1 in that position (so the error is 1000-1=999), even though the model made the correct prediction. There's some confusion about this (e.g. https://news.ycombinator.com/item?id=18054447 ), so hopefully my explanation makes sense. |
|