Hacker News new | ask | show | jobs
by tvural 3495 days ago
The best explanation is probably that squared error gives you the best fit when you assume your errors should normally distributed.

Things like the fact that squared error is differentiable are actually irrelevant - if the best model is not differentiable, you should still use it.

3 comments

"if the best model is not differentiable, you should still use it."

I'm not sure I would say that - neural nets are "near everywhere differentiable", for example. Without differentiability we're stuck with, for example, discrete GAs for optimization, and you can throw all your intuition out the window (not to mention training/learning efficiency).

A few misconceptions I should correct in this comment.

- There is plenty of existing technology for handling non-differentiable function. Functions like the absolute value, 2-norm, and so on have a generalization of the gradient (the subgradient) which can be used in lieu of the gradient.

- That functions are "almost everywhere differentiable" (i.e. the non-differentability lies in a manifold of zero measure) makes these functions behave pretty much like smooth ones. This is often not the case as optima often conspire to lie exactly on these nonsmooth manifolds.

And error measures involving sum of absolute values (i.e., L1 norm) are central to methods like lasso (https://en.wikipedia.org/wiki/Lasso_(statistics)) and their cousins.
Yes, that was what I was saying. Absolute value, 2-norm are fine thanks to subgradient techniques and theory, as well as their differentiability over the majority of the function - but you can imagine tons of non-differentiable models where the subgradient is mostly useless and we generally use convex relaxations or other smoother analogs.

I don't think there was any misconception.

The fact that squared error is differentiable is not irrelevant. You can solve some machine learning models faster with differentiable objectives (most notably xgboost). Speed is important, you need to optimize your models and the longer it takes to run a model the less things you can try.
Regardless of how distributed the errors are, the squared error fit will provide the expectation value of the variable, which is the mean. It will say nothing of the error of the mean it calculates.