Hacker News new | ask | show | jobs
by EE84M3i 2198 days ago
Serious (and likely ignorant) question - what does linearity have to do with anything here? linear over what and why does non-linearity make something 'unpredictable'?
3 comments

Linear models have more bias, so they represent current data less well and are more predictive of future, unseen data (think of a straight line through a point cloud).

Non-linear models have more variance so they represent current data better and are less predictive of future, unseen data (think of a line snaking around a point cloud).

An added complication is that deep neural net models are, in practice, vectors (or, well, tensors) of numbers so they are difficult to interpret. This and their extreme variance makes it hard to know how they will behave in the future.

The bias/variance trade-off is not really related to extrapolation. Think of a point cloud following a quadratic shape. A linear model will extrapolate terribly.
Well, "more predictive" doesn't mean it's a perfect fit. Every model has error. A line through a point cloud curving upwards will still represent some of the points in the cloud. So it will have high error, but it's still a representation of the data.

And yes, the bias-variance tradeoff is about generalisation (i.e. the ability to extrapolate to unseen data). But this is more related to the fact that in the real world, problem spaces don't have nice, friendly, regular shapes nor do their shapes stay put after we've trained a model.

My understanding is that generally, the error when extrapolating to areas not covered by the training data distribution would be considered to be part of the "bias" part of the bias-variance tradeoff.

The way I see it, the variance is the part of the error that you can reduce by collecting more data from your distribution and increasing model complexity if needed.

The bias part is what will not get better no matter how much you sample your distribution, and extrapolation problems fall into that category.

>> The way I see it, the variance is the part of the error that you can reduce by collecting more data from your distribution and increasing model complexity if needed.

Ah, apologies, I see what you mean. That is true, but this "error" is in-sample error, so increasing your model's variance will increase its ability to interpolate but not extrapolate to out-of-sample data, as I explain in my longer comment.

"In-sample" means all the data you've collected to train and test with. It includes training/validation/test splits. At the end of k-fold cross-validation, your model has "seen" all the data in your sample and the model that performs best is the model that best represents that data.

But, because the data was sampled from a distribution that is most likely not the true distribution of the data (since that distribution is unknown), the sampling error (i.e. the differences between the true and sample distributions) will be reflected in the model. A high-variance model will suffer more from this than a high-bias one.

Sorry I didn't understand immediately what you meant. The longer comment above is correct but probably doesn't help answer your question directly.

Thanks for taking the time to write the detailed reponses. Definitely led me to think more closely about these vaguely held intuitions about bias and variance! I think you are exactly right that the crucial aspect is the variance when looking at out-of-sample predictions, not just across several samplings from the original training distribution (a la k-fold crossvalidation).
Bias and variance are characteristics of the model, not components of its error as I think you're saying. In the most simple sense, bias and variance refer to the shape of the function represented by the model (let's say "the shape of the model" for simplicity). A model with a more "rigid" shape (approaching a straight line) has more bias and one with a more "relaxed" shape (further from a straight line) has more variance.

The extent to which a model can extrapolate to out-of-sample data depends on how well the shape of the model follows the true distribution of the data. This is true regardless of the bias and variance of the model. It just happens that most of the time, in interesting, real-world problems, the true distribution of the data is more or less different than the sampling distribution of the training data- i.e. there's always some amount of "sampling error".

Sampling error can't be reduced by collecting more training data- you just have more data with the same sampling error. Increasing model complexity increases variance, so if you start with high sampling error, you wil get a high error on out-of-sample data because your model matches the "off" distribution of the training data too closely. What training with more data and with a more complex model can do is increase the ability of the trained model to interpolate, i.e. to accurately represent (new) data points that are in the same region of "instance space" as the training data points.

A high-bias model can extrapolate well if the sampling error is not too high and the shape of the true distribution is not too irregular. However, a high-bias model will also not interpolate as well as a high-variance model. Its rigid structure will "miss" many data points. Like you say, this will not change if you train with more data. Anyway, that's the tradeoff.

Now, the reason why deep neural nets, which are extremely high-variance models, are trained with large amounts of data, is that they can interpolate very well but can't extrapolate very well. If a model doesn't extrapolate very well but its training sample is a large enough chunk of instance space, it can still be very useful, because it's still representing a large number of instances.

How to put it? Mabye your high-variance model has seen examples of white dogs and black dogs in training, but no green dogs. Your model will not be able to generalise to green dogs, but if green dogs are rare, it will still be able to represent most dogs, so it's still useful.

Of course, looking at the output of a trained model (its behaviour) doesn't tell you anything about what it was trained on. So a model that has very high accuracy on a large number of tasks will look impressive, even if it can't generalise at all.

I'm not good at math, but I'm confused by the association of AI with non-linear stuff, setting aside the association of non-linear with "bad". I thought ML involved linear algebra or something (says xkcd!) which would presumably be...linear?
The inner activation function (AF) of neurons is inherently nonlinear; it has to be in order to solve any problem that is not linearly decomposable (which is basically all of the interesting problems). Often the AF nonlinearity shows up as a thresholding operation following a linear weighted sum, but that's not the only mechanism.

And yet neurons are not "pure" binary thresholders the way logic gates are because you can't take the derivative of a binary function, and you can only do backpropagation on differentiable functions. The compromise neurons make is a "smoothed threshold" or sigmoidal curve which is differentiable but still very nonlinear.

I'm not sure where the "linear" in "linear algebra" comes from. You hear about linear algebra in relation with machine learning a lot because training a neural net (with the backpropagation algorithm and friends) requires some matrix arithmetic. Inputs to neural nets are vectors or matrices, their weights are (arrayed in) vectors or matrices, their outputs are - well, usually scalars but can also be vectors or matrices.

Also, the use of linear/ nonlinear in machine learning is a bit misleading. A "line" is not necessarily a "straight line", but usually when we say "linear" we mean "straight" and so when we want to say "not straight" we use "nonlinear".

In any case, when we say "line" in machine learning we mean a function, the function of a line. So a "nonlinear" function is a function that curves and turns, e.g. a sigmoid, whereas a "linear" function is straight as a rod.

Why a line? Classifiers er classify by drawing a line through space. "Space" means a Cartesian space where our training examples are represented as points (hence, "data points"). Data points are located in Cartesian space according to coordinates that represent their attributes, or features (these coordinates are the "feature vectors" that are input to neural nets). We classify data points by drawing a line between those that belong to one class and those that belong to other classes. More to the point, when we train a classifier, we find the parameters of a function of a line that separates the points of separate classes and when we want to classify a new point, we look at where it falls with relation to that line.

So that's where all that stuff about lines and "linear" and "nonlinear" models comes from. A "linear model" or "linear classifier" can only draw straight lines. A "nonlinear model" can go twirling around madly.

Finally, "non-linear" doesn't mean "bad". There are tradeoffs- in particular, the "bias variance tradeoff" that I hint at in my earlier comment. A linear model is more limited in what it can represent, but a nonlinear model is less likely to represent data that it hasn't seen in training.

- "linear" in "linear algebra" comes from "system of linear equations"

- NN can absolutely represent non-linear functions, and they are based on solving system of linear equations.

- The non-linear function here has nothing to do with the linearity of the system of linear equations used to construct it.

- The two main sources of non-linearity are, (a) the inputs (e.g., an image, or a series of images varying a non-linear fashion), and (b) the activation functions.

The underlying derivatives are linear (like all derivatives) but neural networks' ability to approximate arbitrary non linear functions is one of their biggest strengths.
Yes, so I'm left wondering, when making the association of the math to the badness, how do you decide if the linearity or the non-linearity is the salient part?
Mathematically, you can think of "linear" AI problems as "easy to solve", and non-linear as "difficult". That's part of what the parent means.

Some function being linear means it's easier to guess. If a real world phenomenon is tied to a linear function, then it's easy for AI to guess/approximate.

If you have ever opened up Excel or a similar program. One of the more useful options is to generate a regression line-fit on your data points.

One option is to specify a polynomial function, you can specify how many coefficients you want. One of the measurements is the mean-squared-error between the line-fit and the points.

You can add as many polynomial coefficients as you want, and you will be able to decrease the mean squared error. But the more polynomial's you choose, two things will be true:

1. The line-fit will be far more likely to go through the points.

2. At points in the line where there was no data, the line will less approximate the underlying physical reality.

That same mathematical property is what is relevant here. There is nothing inherently evil about non-linearity, when the non-linear math model properly maps to the physical reality. But when you over fit a line, many of the functional solutions may be completely wrong.

I'm confused. I agree that overfitting can lead to very bad models.

But, what I don't understand is that I thought that "linear" in ML contexts was normally used in the sense of 'linear transformations', which is a sense of linear that 'line-fit' from excel isn't -- it's affine.

Is a linear model with thousands/millions of weights/parameters (like deep learning models) really substantially simpler to understand? Can it do anything useful?

[1]: https://en.wikipedia.org/wiki/Linear_map

I suppose from the perspective of someone implementing these models, yeah - it is linear, but it is not bijective. In a system with only one layer, that manifests as an alias (assuming the output dimensions are smaller). In a system with multiple layers of either `N->M` or `M->N`, those aliases tend to manifest as apparent "non-linearities".

So, I guess looking from the bottom up the system may look non-continuous and linear. But if you look from the top down, it would look continuous and non-linear.

Really, I am not sure which one is "true".

I assume they are using non-linear to mean non-continuous, which implies that there can be large, hard-to-understand changes in behavior when the input is changed only a small amount.
Polynomials with large degrees are continuous. It's just that they can still change by a large amount (i.e. having a large derivative) when the input is changed by a small amount.

I invite you to construct the Lagrange polynomial (i.e. interpolating polynomial) for points on a nice, simple curve with some noise. They will, by definition, pass through every point given, and yet it will likely behave very badly outside the range of the given points.

There is nothing wrong with using a non-linear model, though; x^2 or x^3 regressions make sense on many datasets.

Non-continuous is also not the perfect terminology, but I argue that it is more precise than non-linear: the chief idea being that the model "changes unpredictably."

Sure you can argue things however you want, if you also decide to ignore hundreds of years of mathematical terminology.