| HN Mirror

I don't think that this is quite the right explanation.

|y - wx| isn't differentiable but it is subdifferentiable. So the derivative is defined at every point except where y = wx, and in that case you're more or less fine if you just pretend that the derivative is zero at that point.

One reason why squaring is preferred is that the derivative has a nice closed form, which can be used to find closed form solutions.

Another is that people like to fit their models with gradient descent, and gradient descent has better guarantees for strongly convex loss functions. It also works better in practice. Intuitively: if you try to minimize x^2 by gradient descent, you have the largest gradients when you're far from the minimum. If you try to minimize |x|, then your gradient is always either +1 or -1, making it harder to converge around the minimum. See the Huber loss for a strongly convex relaxation of the absolute value function.