Hacker News new | ask | show | jobs
by jampekka 408 days ago
The main practical reason why square error is minimized in ordinary linear regression is that it has an analytical solution. Makes it a bit weird example for gradient descent.

There are plenty of error formulations that give a smooth loss function, and many even a convex one, but most don't have analytical solutions so they are solved via numerical optimization like GD.

The main message is IMHO correct though: square error (and its implicit gaussian noise assumption) is all too often used just per convenience and tradition.

4 comments

I’ve always felt that ML introductions completely butcher OLS. When I was taught it in stats we had to consider the Gauss-Markov conditions and interpret the coefficients, we would study the residuals. ML introductions just focus getting good predictions.
IMO that's the fundamental difference between statistics and ML. The culture of stats is about fitting a model and interpreting the fit, while the culture of ML is to treat the model as a black box.

That's one of the reasons that multicollinearity is seen as a big deal by statisticians, but ML practitioners couldn't give a hoot.

You are describing the difference between academic mathematician statisticians and "applied/engineering/actuarial/business" people who use statistics. The "black box" culture goes back to before ML and before both computing Machines M and statistical Learning (iterative models)
I suspect that the "black box" philosophy for statistics/ML is actually bad if you don't have a quick way of verifying the predictions. For instance, using PCA as a "black box" is perfectly fine if you're using it to de-noise readings from a camera or other instrument, because a human being can quickly tell if the de-noising is working correctly or not. But if you're using PCA to make novel discoveries, where you don't have an independent way of checking those discoveries, then it might be outright essential to have a deep definition-theorem-proof style understanding of PCA. What do people think of this hunch?

The point about PCA applies to population genetics and psychometrics (IQ). Some conclusions have been derived using PCA that appear to be supported by little else, and these have come under question.

You make a good point, though the difference between ML and statistics isn't just about interpreting and validating the model. It's about the "novel discoveries" part aka Doing Science.

Statistical modeling is done primarily in service of scientific discovery--for the purpose of making an inference (population estimate from a sample) or a comparison to test a hypothesis derived from a theoretical causal model of a real-world process before viewing data. The parameters of a model are interpreted because they represent an estimate of a treatment effect of some intervention.

Methods like PCA can be part of that modeling process either way, but analyzing and fitting models to data to mine it for patterns without an a priori hypothesis is not science.

Only perfect multicollinearity (correlation of 1.0 or -1.0) is a problem at the linear algebra level when fitting a statistical model.

But theoretically speaking, in a scientific context, why would you want to fit an explanatory model that includes multiple highly (but not perfectly) correlated independent variables?

It shouldn't be an accident. Usually it's because you've intentionally taken multiple proxy measurements of the same theoretical latent variable and you want to reduce measurement error. So that becomes a part of your measurement and modeling strategy.

I think this distinction is not sharp. You do hear ML practitioners talk about interpretability a lot.
This isn't true. In practice people don't use the analytical solution for efficient linear regression, they use stochastic methods.

Square error is used because it is the maximum likelihood estimator under the assumption that observation noise is normally distributed, not because it is analytical.

AFAIK using the analytic solution for linear regression (via lm in R, statsmodels in python or any other classical statistical package) is still the norm in traditional disciplines such as social (economics, psychology, sociology) and physical (bio/chemistry) sciences.

I think that as a field, Machine Learning is the exception rather than the norm, where people people start off or proceed rapidly to non-linear models, huge datasets and (stochastic) gradient based solvers.

Gaussianity of errors is more of a post-hoc justification (which is often not even tested) for fitting with OLS.

If by stochastic methods you mean something like MCMC, they are increasing in popularity, but still used a lot less than analytical or numerical methods. And almost exclusively only for more complicated models than basic linear regression. Sampling methods have major downsides, and approximation methods like ADVI are becoming more popular. Though sampling vs approximations is a bit off topic, as neither usually have closed form solutions.

Even the most popular more complicted models like multilevel (linear) regression make use of the mathematical convenience of the square error, even though the solutions aren't fully analytical.

Square error indeed gives estimates for normally distributed noise, but as I said, this assumption is quite often implicit, and not even really well understood by many practitioners.

Analytical solutions for squared errors have a long history for more or less all fields using regression and related models, and there's a lot of inertia for them. E.g. ANOVA is still the default method (although being replaced by multilevel regression) for many fields. This history is mainly due to the analytical convenience as they were computed on paper. That doesn't mean the normality assumption is not often justifiable. And when not directly, the traditional solution is to transform the variables to get (approximately) normally distributed ones for analytical solutions.

It’s not because of analytical convenience, it’s because of the central limit theorem.
Not everything is a linear combination of large number of (IID) samples, and thus not everything is gaussian distributed.
You’re implying that many things are though.
Yes, and I was explicit about it in another comment in this post.
...because stochastic methods are implicit regularizers, leading to solutions that generalize better. Let's spell it out for those that don't know.

https://www.inference.vc/notes-on-the-origin-of-implicit-reg...

OLS is a convex optimization problem, so this doesn't really apply. And for statistical analysis you really don't want to add poorly understood artificial noise to the parameter estimates anyway.
In general you do, because the unbiased estimates have higher generalization error. You are already dealing with sampling noise. I am not an expert in optimization, and what "poorly understood" means to you, but I know there is quite some research on the properties of SGD noise; e.g., https://francisbach.com/rethinking-sgd-noise/

Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning https://arxiv.org/abs/2301.13703

That is incorrect. Least squares follows directly from the central limit theorem.
Central limit theorem tells in practice that gaussian distributions is can be expected to be quite common. And it makes the gaussian distribution a good first guess. Least squares gives the ML estimate for gaussian residuals. I don't find this very direct, and there being a rationale doesn't mean that rationale is what in reality drives the usage.

I mention the relation to the gaussian distribution. Which part of the comment is incorrect?

This part is incorrect: “ The main practical reason why square error is minimized in ordinary linear regression is that it has an analytical solution”

OLS is popular because it gives correct answers as a result of the CLT

And it has an analytical solution, which was important before computing (and still makes it quicker today).
In other words, as economists say, because OLS is provably the BLUE (Best Linear Unbiased Estimator) aka the Gauss-Markov Theorem.
OLS is a straightforward way to introduce GD, and although an analytic solution exists it becomes memory and IO bound at sufficient scale, so GD is still a practical option.
Computationally OLS is taking the pseudoinverse of the system matrix, which for dense systems has a complexity of O(samples * parameters^2). For some GD implementations the complexity of a single step is probably O(samples * parameters), so there could be a asymptotic benefit, but it's hard to imagine a case where the benefit is even realized, let alone makes a practical difference.

And in any case nobody uses GD for regressions for statistical analysis purposes. In practice Newton-Raphson or other more complicated schemes (with a lot higher computation, memory and IO demands) with a lot nicer convergence properties are used.

Mini batch and streaming GD make the benefits obvious and trivial. Closed form OLS is unbeatable so long as samples * params^2 is comfortably sitting in memory. You often lose that as soon as your p approaches 10^5, which is common these days. Soon as you need distributed, streaming, or your data is too tall and or too wide then first order methods are the point of call.
With batching it becomes SGD. If you're OK with approximations, you have e.g. randomized, reduced rank and streaming SVDs. And these tend have a lot nicer approximation and convergence properties than SGD.

What are the common cases for 10^5 parameter OLS? Perhaps something like weather models could include such computations?