Hacker News new | ask | show | jobs
by senekerim 3857 days ago
It's not really a cheat, as it produces the MLE (i.e. most likely) estimate under certain assumptions (e.g. errors are normally distributed, which occurs naturally if the errors are large sums of many unrelated measurements errors).
1 comments

Oh, I'm aware. It's just hard to always justify it and especially hard to do so in historical context.
I don't think that this is quite the right explanation.

|y - wx| isn't differentiable but it is subdifferentiable. So the derivative is defined at every point except where y = wx, and in that case you're more or less fine if you just pretend that the derivative is zero at that point.

One reason why squaring is preferred is that the derivative has a nice closed form, which can be used to find closed form solutions.

Another is that people like to fit their models with gradient descent, and gradient descent has better guarantees for strongly convex loss functions. It also works better in practice. Intuitively: if you try to minimize x^2 by gradient descent, you have the largest gradients when you're far from the minimum. If you try to minimize |x|, then your gradient is always either +1 or -1, making it harder to converge around the minimum. See the Huber loss for a strongly convex relaxation of the absolute value function.