Hacker News new | ask | show | jobs
by orlp 408 days ago
This isn't true. In practice people don't use the analytical solution for efficient linear regression, they use stochastic methods.

Square error is used because it is the maximum likelihood estimator under the assumption that observation noise is normally distributed, not because it is analytical.

3 comments

AFAIK using the analytic solution for linear regression (via lm in R, statsmodels in python or any other classical statistical package) is still the norm in traditional disciplines such as social (economics, psychology, sociology) and physical (bio/chemistry) sciences.

I think that as a field, Machine Learning is the exception rather than the norm, where people people start off or proceed rapidly to non-linear models, huge datasets and (stochastic) gradient based solvers.

Gaussianity of errors is more of a post-hoc justification (which is often not even tested) for fitting with OLS.

If by stochastic methods you mean something like MCMC, they are increasing in popularity, but still used a lot less than analytical or numerical methods. And almost exclusively only for more complicated models than basic linear regression. Sampling methods have major downsides, and approximation methods like ADVI are becoming more popular. Though sampling vs approximations is a bit off topic, as neither usually have closed form solutions.

Even the most popular more complicted models like multilevel (linear) regression make use of the mathematical convenience of the square error, even though the solutions aren't fully analytical.

Square error indeed gives estimates for normally distributed noise, but as I said, this assumption is quite often implicit, and not even really well understood by many practitioners.

Analytical solutions for squared errors have a long history for more or less all fields using regression and related models, and there's a lot of inertia for them. E.g. ANOVA is still the default method (although being replaced by multilevel regression) for many fields. This history is mainly due to the analytical convenience as they were computed on paper. That doesn't mean the normality assumption is not often justifiable. And when not directly, the traditional solution is to transform the variables to get (approximately) normally distributed ones for analytical solutions.

It’s not because of analytical convenience, it’s because of the central limit theorem.
Not everything is a linear combination of large number of (IID) samples, and thus not everything is gaussian distributed.
You’re implying that many things are though.
Yes, and I was explicit about it in another comment in this post.
Ok, so we all agree that in most cases the reason to minimize square error is that it’s the appropriate thing to minimize - not that it has an analytical solution, convenience or tradition.
...because stochastic methods are implicit regularizers, leading to solutions that generalize better. Let's spell it out for those that don't know.

https://www.inference.vc/notes-on-the-origin-of-implicit-reg...

OLS is a convex optimization problem, so this doesn't really apply. And for statistical analysis you really don't want to add poorly understood artificial noise to the parameter estimates anyway.
In general you do, because the unbiased estimates have higher generalization error. You are already dealing with sampling noise. I am not an expert in optimization, and what "poorly understood" means to you, but I know there is quite some research on the properties of SGD noise; e.g., https://francisbach.com/rethinking-sgd-noise/

Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning https://arxiv.org/abs/2301.13703