| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tvural 3495 days ago
	The best explanation is probably that squared error gives you the best fit when you assume your errors should normally distributed. Things like the fact that squared error is differentiable are actually irrelevant - if the best model is not differentiable, you should still use it.

3 comments

highd 3495 days ago

"if the best model is not differentiable, you should still use it."

I'm not sure I would say that - neural nets are "near everywhere differentiable", for example. Without differentiability we're stuck with, for example, discrete GAs for optimization, and you can throw all your intuition out the window (not to mention training/learning efficiency).

link

gabrielgoh 3495 days ago

A few misconceptions I should correct in this comment.

- There is plenty of existing technology for handling non-differentiable function. Functions like the absolute value, 2-norm, and so on have a generalization of the gradient (the subgradient) which can be used in lieu of the gradient.

- That functions are "almost everywhere differentiable" (i.e. the non-differentability lies in a manifold of zero measure) makes these functions behave pretty much like smooth ones. This is often not the case as optima often conspire to lie exactly on these nonsmooth manifolds.

link

kkylin 3495 days ago

And error measures involving sum of absolute values (i.e., L1 norm) are central to methods like lasso (https://en.wikipedia.org/wiki/Lasso_(statistics)) and their cousins.

link

highd 3495 days ago

Yes, that was what I was saying. Absolute value, 2-norm are fine thanks to subgradient techniques and theory, as well as their differentiability over the majority of the function - but you can imagine tons of non-differentiable models where the subgradient is mostly useless and we generally use convex relaxations or other smoother analogs.

I don't think there was any misconception.

link

throw_away_777 3495 days ago

The fact that squared error is differentiable is not irrelevant. You can solve some machine learning models faster with differentiable objectives (most notably xgboost). Speed is important, you need to optimize your models and the longer it takes to run a model the less things you can try.

link

eanzenberg 3495 days ago

Regardless of how distributed the errors are, the squared error fit will provide the expectation value of the variable, which is the mean. It will say nothing of the error of the mean it calculates.

link