Hacker News new | ask | show | jobs
by thaumasiotes 51 days ago
> The stated problem of mapping raw inputs/scores/logits to a probability distribution can be solved by a bunch of arbitrary functions, and the usual justification given for a softmax is "it has nice derivatives" which is empirically useful but not satisfying.

Often there isn't any more to it than that. For example, the entire justification for least-squares error measurement is that it has convenient derivatives.

1 comments

The central limit theorem is an extremely powerful justification. That doesn't mean it's considered whenever it's used, but it absolutely can be strongly justified (to the degree that other error measurements are only needed in relatively small samples of the feature space where errors will not yet converge to Gaussian)