Hacker News new | ask | show | jobs
by _gmax0 1157 days ago
The most concise and intuitive line of explanation I've been given goes along the lines of this:

1 - We want to model data, representative of some system, through functions.

2 - Virtually any function can be expressed by a n-th order polynomial.

3 - We wish to learn the parameters, the coefficients, of such polynomials.

4 - Neural networks allow us to brute-force test candidate values of such parameters (finding optimal candidate parameters such that error between expected and actual values of our dataset are minimized)

Whereas prior, methods (e.g. PCA) could only model linear relationships, neural networks allowed us to begin modeling non-linear ones.

6 comments

You don't need neural networks to do polynomial regression. Polynomial regression, perhaps surprisingly, can be implemented using only (multivariable) linear regression. You just include powers of your predictor x as terms in the regression formula:

  y = a + bx + cx^2 + dx^3 + ...
The resulting model is linear, even though there are powers of x in your formula. Because x and y are known from the data. They're not what you're solving for, you're solving for the unknown coefficients (a, b, c, d...). This gives you a linear system of equations in those unknown coefficients, which can be solved using standard linear least squares methods.

So fitting polynomials is easy. The problem is that it's not that useful. Deep learning has to solve much harder problems to get to a useful model.

Hm, I don't think that's quite it. I went through my own process of learning how neural networks work recently and wrote this based on my learning: https://sebinsua.com/bridging-the-gap

As far as my understanding goes, you can represent practically any function as layers of linear transformations followed by non-linear functions (e.g. `ReLU(x) = max(0, x)`). It's this sprinkling of non-linearity that allows the networks to be able to model complex functions.

However, from my perspective, the secret sauce is (1) composability and (2) differentiability. These enable the backpropagation process (which is just "the chain rule" from calculus) and this is what allows these massive mathematical expressions to learn parameters (weights and biases) that perform well.

Mentioning polynomials is a pretty poor way to explain it for two reasons:

- It requires some mathematical understanding so will exclude some part of the non-technical audience

- It is the incorrect analogy. Non-linearities in neural networks have nothing to do with polynomials. In fact, polynomial regression is a type of linear regression, and for the most part, it sucks.

Also, as someone mentioned, all the “serious” alternative ML methods prior to the deep learning revolution allow modeling non linearities (even if just through modification of linear regressions, like polynomial regression).

Thanks for the correction. It's been some time since I actively thought about the theory (evidently I didn't digest it correctly the first time!).
> Virtually any function can be expressed by a n-th order polynomial.

But there are many things that are not functions. Like circles. And they tend to crop up a lot in the real world, no pun intended.

Well, technically a circle can't be said to be a function but not for the reason you mean. A circle is a collection or a set of points, for example in a 2d plane, that are equidistant from a center point.

Probably what you are trying to say is that "a circle is not the image of a function", but that is also not true. You're assuming since in cartesian coordinates you can solve for y = +/- sqrt(R^2 - x^2), the fact that y is multi-valued means it's not a function. This is what they teach in highschool pre-calculus anyway.

But for example, we can associate the points on a circle with the image of the function e^{i theta}. Or equivalently, with the R^2-valued function f(theta) = (cos(theta), sin(theta)).

> Whereas prior, methods (e.g. PCA) could only model linear relationships,

Prior methods also allowed modelling of non-linear relationships, eg. Random Forests.

Except gradient descent is about as far from brute force as it gets
Sure, under the assumption that your parameter space is convex.