| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by orlp 408 days ago
	This isn't true. In practice people don't use the analytical solution for efficient linear regression, they use stochastic methods. Square error is used because it is the maximum likelihood estimator under the assumption that observation noise is normally distributed, not because it is analytical.

3 comments

em500 408 days ago

AFAIK using the analytic solution for linear regression (via lm in R, statsmodels in python or any other classical statistical package) is still the norm in traditional disciplines such as social (economics, psychology, sociology) and physical (bio/chemistry) sciences.

I think that as a field, Machine Learning is the exception rather than the norm, where people people start off or proceed rapidly to non-linear models, huge datasets and (stochastic) gradient based solvers.

Gaussianity of errors is more of a post-hoc justification (which is often not even tested) for fitting with OLS.

link

jampekka 408 days ago

If by stochastic methods you mean something like MCMC, they are increasing in popularity, but still used a lot less than analytical or numerical methods. And almost exclusively only for more complicated models than basic linear regression. Sampling methods have major downsides, and approximation methods like ADVI are becoming more popular. Though sampling vs approximations is a bit off topic, as neither usually have closed form solutions.

Even the most popular more complicted models like multilevel (linear) regression make use of the mathematical convenience of the square error, even though the solutions aren't fully analytical.

Square error indeed gives estimates for normally distributed noise, but as I said, this assumption is quite often implicit, and not even really well understood by many practitioners.

Analytical solutions for squared errors have a long history for more or less all fields using regression and related models, and there's a lot of inertia for them. E.g. ANOVA is still the default method (although being replaced by multilevel regression) for many fields. This history is mainly due to the analytical convenience as they were computed on paper. That doesn't mean the normality assumption is not often justifiable. And when not directly, the traditional solution is to transform the variables to get (approximately) normally distributed ones for analytical solutions.

link

xadhominemx 408 days ago

It’s not because of analytical convenience, it’s because of the central limit theorem.

link

jampekka 408 days ago

Not everything is a linear combination of large number of (IID) samples, and thus not everything is gaussian distributed.

link

kgwgk 408 days ago

You’re implying that many things are though.

link

jampekka 408 days ago

Yes, and I was explicit about it in another comment in this post.

link

kgwgk 408 days ago

Ok, so we all agree that in most cases the reason to minimize square error is that it’s the appropriate thing to minimize - not that it has an analytical solution, convenience or tradition.

link

esafak 408 days ago

...because stochastic methods are implicit regularizers, leading to solutions that generalize better. Let's spell it out for those that don't know.

https://www.inference.vc/notes-on-the-origin-of-implicit-reg...

link

jampekka 408 days ago

OLS is a convex optimization problem, so this doesn't really apply. And for statistical analysis you really don't want to add poorly understood artificial noise to the parameter estimates anyway.

link

esafak 408 days ago

In general you do, because the unbiased estimates have higher generalization error. You are already dealing with sampling noise. I am not an expert in optimization, and what "poorly understood" means to you, but I know there is quite some research on the properties of SGD noise; e.g., https://francisbach.com/rethinking-sgd-noise/

Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning https://arxiv.org/abs/2301.13703

link