Why Mean Squared Error?

Y	Hacker News new \| ask \| show \| jobs

	Why Mean Squared Error? (danijar.com)
	37 points by danijar 3053 days ago

3 comments

tofof 3053 days ago

I think https://www.benkuhn.net/squared offers a much more comprehensive view of the advantages of squared error in about the same length.

Edit: turns out that article was actually featured on HN previously! See https://news.ycombinator.com/item?id=13032210 for the thorough accompanying discussion.

link

get 3053 days ago

TLDR:

Because it gives more weight to one big error then to multiple small ones with the same sum.

We want the errors to be noise and not systematic. Noise usually has a gaussian distribution. And in a gaussian distribution multiple small values are more likely than one big one.

link

get 3053 days ago

An example:

Imagine these two predictors:

    Reality: 1 1 1 1 1 9 1 1 1 1
    Predic1: 2 2 2 2 2 2 2 2 2 2
    Predic2: 3 3 3 3 3 6 3 3 3 3

    SumOfErrors(Predic1) is 16
    SumOfErrors(Predic2) is 21

So Predic1 was better then Predic2? No. Because correctly predicting the one outlier shows more predictive power then staying close to the average. Therefore we use SumOfSquerrors:

    SumOfSquerrors(Predic1) is 58
    SumOfSquerrors(Predic2) is 45

This shows that Predic2 is "better" and we are happy :)

link

scrooched_moose 3053 days ago

It should be 16 & 21 and 58 & 45.

link

get 3053 days ago

True. Fixed. Thanks.

link

ralusek 3052 days ago

But what I've never understood is that if your objective is to magnify errors, why not cube it? Why not to a greater power still? If the other benefit is that all negative values to an even power become positive, then why not take the absolute value of the cube? No matter what, the degree to which we magnify errors strikes me as arbitrary.

link

srean 3053 days ago

> Noise usually has a gaussian distribution

This belief is often a good indicator that a data scientist is divorced from reality.

link

Someone 3053 days ago

The intuitively more logical “average absolute error” doesn’t necessarily have a unique solution. For example: if your samples are x1 and x2, any estimator between x1 and x2 has minimum average absolute error.

Squared error doesn’t have that problem (the midpoint beteren x1 and x2 uniquely minimizes it) and (very important historically) is easy to compute for the linear regression case. That’s why linear regression and squared error won. The rest, I think, is gravy. If absolute error were easy to minimize, we might even have found/invented some other properties that ‘show’ why that is nice.

link