Hacker News new | ask | show | jobs
by gmfawcett 2994 days ago
> Assumptions of linear regression: There must be a linear relation between independent and dependent variables.

That's not wrong, but it's a strong way to word it. If linear regression were only suitable when the variables were perfectly linearly related, it would get a lot less use. Practically, linear regression can be used when the relationship is linear-ish, at least in the interval of interest. In other words, you can choose to declare linearity as an assumption (and take responsibility for what that choice entails, and for the error it might introduce into your analysis).

4 comments

That the linear model is "correct" is only assumption if you're trying to draw probabilistic inferences.

There's nothing stopping you from using it as a "best fit line", even when you have no reason to believe those assumptions. But then it's just a best-fit line. It tells you the direction and magnitude of linear trend, nothing more. That's never wrong in any sense, it's just that sometimes it's not very useful.

From the semiparametric perspective, you can still make correct inferences about the estimated parameters even if the model is not correctly specified, as long as you use the so-called robust estimator of the variance.
This is impossible. If the model is incorrectly specified (does not include all and only the relevant parameters and interactions), it doesn't matter much what games are played with the math. Changing the model will change the estimates...

Edit: For example, see here where making arbitrary choices of how to code categorical variables will change the estimates: https://news.ycombinator.com/item?id=16719754

If you change the model the meaning of all the coefficients changes.

Oops I did not see your response until now.

I agree, changing the model changes the estimates, because the parameters you are estimating change.

However, given one misspecified model, the parameters of that model are still well defined, though they may not have the interpretation they would if the model was correctly specified. As OP called it, this is the "best fit line", and is a projection of the truth onto your model. E.g. for a simple linear regression of Y on X, where the true conditional mean of Y given X is not linear, there is still some "true" best line. This line depends also on the distribution of X, though it would not if the model was correct. Estimates from linear regression will converge to the parameters of this line, though using the usual standard errors will be wrong.

There's a very general theorem or corollary that covers this in Asymptotic Statistics by van der Vaart. I think in the chapter about M estimators, right around where MLEs are covered, but I don't have it in front of me.

There are multiple inference levels here.

First, there is the statistical level, at which we are drawing some conclusion about the model parameter. This may work even for a misspecified model.

Then there is the level at which you want to draw some conclusion about reality, call it the "scientific level". If the model is misspecified, the parameters/coefficients may or may not correspond to the thing of interest. Perhaps the model is a close enough approximation for those values to be meaningful, perhaps not...

I think it is the second ("scientific level") of inference that most people are concerned about. The rigor of the proofs/theorems that may work at the statistical level does not extend to the scientific level.

Afaict, the majority of erroneous inference occurs at the scientific level and statistical error/uncertainty is a sort of minimum error/uncertainty.

Yes, well put!
It is actually wrong. The assumption is that y is a linear combination of the covariates in X. You can run regressions like y = x + x^2 (i.e. you permit a quadratic relationship) just fine.
It's not wrong, it's just a way of looking at things that speaks to the underlying math rather than the full extent of what you can do with it if you extend it with things like kernel methods.

When you use linear regression to fit a model like

  y ~ ax + b(x^2)
what you're technically doing is fitting a linear function with two parameters on two variables. One variable happens to always be equal to the square of the other variable, but, for the purpose of how the model is usually going to be fit, it is still using the same old analytical method that's based in linear algebra.
Fair enough. Mechanically, all you're ever doing when estimating a parameter vector using OLS is projecting Y onto the span of X, and that requires linearity in the sense that Y = XB. But far too often I've met people who've come away thinking OLS is useless because they mistake the linearity in parameters with 'y must be a linear function of x', which is they think is too simplistic, and so they go do more complicated methods when OLS would have been just fine as long as they used polynomials and/or interaction terms.
To me, that's a stellar example of why you probably shouldn't have people who don't even have a basic undergraduate "intro to stats" understanding of the subject doing your statistics work.

I get that it's a potential cause of confusion for someone who has no training in stats. But it's also jargon that describes a useful concept, and that is literally transparent if you do have enough understanding of the math to know what "linear" and "parameters" mean in this context.

This 'assumption' always bothered me when studying for DS roles because it's something that you're expected to know if asked, but isn't really true/accurate. Another is the non-collinearity assumption between variates, which is violated all the time in ML tasks but an 'assumption' of the model nonetheless.
In general, it's helpful for me to separate assumptions that are characteristics of the generative model (and possibly an inference procedure used with it), and "assumptions" meaning things that could lead to poor out of sample prediction.
You can say a lot about what linear regression gets you when you fit a line to nonlinear data. It's a weighted average of the derivative. You don't need a linearity assumption.