|
I asked that early in my career. We want a metric essentially because
if we converge or have a good
approximation in the metric
then we are close in some important
respects. Squared error, then, gives one such
metric. But for some given data, usually there are several metrics we might use, e.g., absolute error (L^1), worst case error (L^infinity), L^p for positive integer p, etc. From 50,000 feet up, the reason for using squared error is that get to have the Pythagorean theorem, and, more generally, get to work in a Hilbert space, a relatively nice place to be, e.g., we also get to work with angles from inner products, correlations, and covariances -- we get cosines and a version of the law of cosines. E.g., we get to do orthogonal
projections which give us minimum
squared error. With Hilbert space, commonly we
can write the total error
as a sum of contributions
from orthogonal
components, that is, decompose
the error into contributions
from those components -- nice. The Hilbert space we get from squared error gives us the nicest version of Fourier theory, that is, orthogonal
representation and decomposition,
best squared error approximation. We also like Fourier theory with
squared error
because
of how it gives us the Heisenberg
uncertainty principle. Under meager assumptions, for
real valued random variables
X and Y, E[Y|X], a function of X,
is the best
squared error approximation
of Y by a function of X. Squared error gives us variance, and
in statistics sample mean and variance are sufficient statistics for the Gaussian; that is, for statistics, for Gaussian data, can take the sample mean and sample variance, throw away the rest of the data, and do just as well. For more, convergence in squared error can imply convergence almost surely at least for a subsequence. Then there is the Hilbert space result, every nonempty, closed, convex subset has a unique element of minimum norm (from squared error) -- nice. |
Many nice properties of the square loss (in fact un-fucking-believably nice properties) stem not from the fact that its square root is a metric but from the fact that it is a Bregman divergence. Another oft used 'divergence' in this class is KL divergence or cross-entropy.
Bregman introduced this class purely as a machinery to solve convex optimization problems. His motivation was to generalize the method of alternating projection to spaces other than a Hilbert space. But it so turned out that Bregman divergences are intimately connected with the exponential family class of distributions, also called the Pitman, Darmois, Koppman class of distribution. It takes some wracking of the brain to come up with a parametric family that does not belong in this class if one is caught unprepared, almost all parametric families used in stats (barring a few) belong to this class.
One may again ask why is this class so popular in probability and statistics, the answer is again convenience, they are almost as easy as Gaussians to work with, they have well behaved sufficient statistics, and their stochastic completion gives you the entire space 'regular' enough distributions with finite dimensional parameterizations.
You mentioned conditional expectation. So one may ask what are the loss functions that are minimized by conditional expectation. Bregman divergences are that entire class. Of course square loss satisfies it too (more importantly L2 metric on its own does not, it is the act of squaring it which does this).
Very interesting stuff (at least to me)