Hacker News new | ask | show | jobs
by tibbar 408 days ago
Some important context missing from this post (IMO) is that the data set presented is probably not a very good fit for linear regression, or really most classical models: You can see that there's way more variance at one end of the dataset. So even if we find the best model for the data that looks great in our gradient-descent-like visualization, it might not have that much predictive power. One common trick to deal with data sets like this is to map the data to another space where the distribution is more even and then build a model in that space. Then you can make predictions for the original data set by taking the inverse mapping on the outputs of the model.
3 comments

Non-constant variance does not actually bias the coefficients of a linear regression model -- thus, its predictions will be just fine. What it does is underestimate the standard errors; your p-values will typically be too small. Sometimes a log-transform or similar can help, but otherwise you can use weighted least-squares.

This kind of problem is actually a good intro to iterative refitting methods for regression models: How do you know what the weights should be? Well, you fit the initial model with no weights, get its residuals, use those to fit another model, rinse and repeat until convergence. A good learning experience and easy to hand-code.

In my work, I hardly ever use linear regression, but do use multiple linear regression. Multiple linear regression allows multiple linear predictors, where the method parses shared and independent variances associated with each predictor. These discussions on linear regression hardly ever touches on the very useful multiple linear regression method. In the case of bad variance inflation in models with multi-collinear predictors, robust regression techniques are advised like ridge, LASSO, or elastic net regression.

In relation to gradient descent, I do not know enough if multiple regression is at all relevant, or why not.

And yeah, for non-normal error distributions, we should be looking at generalized linear models, which allows one to specify other distributions that might better fit the data.

What you’re describing is the technique known as the “kernel trick”, correct?
No, the kernel trick is something else: basically a nonlinear basis representation of the model. For example, fitting a polynomial model, or using splines, would effectively be using the "kernel trick" (though only ML people use that term, not statisticians, and usually they talk about it in the context of SVMs but it's fine for linear regression too). Transforming the data is just transforming the Y-outcome, most commonly with log(y) for things that tend to be distributed with a right-skew: house prices being a classic example, along with things like income, various blood biomarkers, or really anything that cannot go below zero but can (in principle) be arbitrarily large.

In a few rare cases I have found situations where sqrt(y) or 1/y is a clever and useful transform but they're very situational, often occurring when there's some physical law behind the data generation process with that sort of mathematical form.

To be fair, the "trick" part of the kernel trick involves implicitly transforming the data into a higher dimensional space and then fitting a linear function in that space. Ie, you're transforming the inputs so that a linear function from inputs to outputs fits better than if you didn't do the transform.

The "trick" allows you to fit a linear function in that higher dimensional space without any potentially costly explicit computation in the higher dimensional space based on the observation that the optimal solution's parameters can be represented as a sum of the higher dimensional representations of points in the training set.

No actually I think you’re mistaken. Representing the model via a nonlinear transformation where a linear model more closely captures what’s going on is precisely what the kernel trick does, although the situation being described is more broad than the kernel trick, things like the power transform also fit the bill.
The kernel trick is a technique used in data classification that involves mapping the points into a higher dimensional space and then finding a linear separation in that higher dimension.

It's not about finding a line of best fit or making the dataset appear linear, it's about being able to split a dataset into two classes using a linear function.

Sure, it’s not about finding a line of best fit, but the principle is the same: a transformed space where linear things work better is used.
Just keep in mind, the kernel trick is a way to transform a data set so that "linear things work better"... although that's very vague I mean sure it's passable but it's also different from what was originally posted... the kernel trick doesn't transform your data into a space where that data becomes linear. It transforms your data into a space where it can be separated by a line/plane. The data is almost always non-linear in that transformed space but it's transformed in a way that a plane can cleanly separate that data.

Given that the kernel trick is pretty specific jargon used mostly in a specific circumstance, it's in your interest to use that term in that specific context. If you're interested in the more general term of making things work with respect to some function, which can be linear or Gaussian or some other form the term is "feature transformation".