Hacker News new | ask | show | jobs
by shawnz 3499 days ago
I am no math expert, but I have always thought about it like this. The squared error is like weighting the error by the error. This causes one big error to be more significant than many small errors, which is usually what you want. Am I on the right track?
3 comments

> This causes one big error to be more significant than many small errors,

That's correct.

> which is usually what you want

Unless you have outliers, in which case it's what you don't want. So you add e.g. a Huber loss function to reach a compromise.

I just thought it was to give positive and negative error values the same treatment. Moreover I think that it's debatable that one big error is more important than many small errors. That is conceivably a bad strategy, in some cases -- if most points have low error, do you really want to penalize your candidate function for having a very few bad outliers? To me that is no better than giving extra favor to a few points that happen to have low error.
No, that's exactly why absolute error is better. "Big errors" are called outliers, they're (relatively) rare, often caused by bad data (measurement errors, typos, etc.) and substiantially influence the outcome of your calculation. In other words, squared error is less robust.

But squared error is easier to compute. So, in practice, what you do is you remove outliers (e.g. cap the data at +-3sigma) then use squared error.

> So, in practice, what you do is you remove outliers (e.g. cap the data at +-3sigma) then use squared error.

But if you are say fitting a function to the data, you can't tell beforehand which data-points are the outliers. So in that case perhaps you need an iterative approach of removing them (?)