|
|
|
|
|
by unishark
2123 days ago
|
|
Gradient descent is already about as easy a training method as can be. Just a little freshman calculus and programmers can do the "state of the art" optimization of modern times. It's also scalable. If your polynomial regression gets too large because of the model complexity (for comparison, typical deep networks can have millions of parameters) you can't invert your matrix and probably end up using a similar method anyway. I would have thought a computer uses tables to compute e^x. There's also piecewise linear activation functions that are trivially easy to compute gradients of. The whole "universal approximation" perspective is pretty vague to begin with. I'd say generally people don't understand why NN's work as well as they do. Previously theorists expected they would need a lot more training data to work, given their complexity. So it's driven to a large degree by empirical success. I am certainly really interested to see people accomplishing the same things with less sophisticated methods, since there is no doubt it has been overused/hyped in some areas just to make the papers and proposals sexier. |
|
Multiple times this. This claim gets trotted around frequently to showcase superiority of NNs.
At best this is a red herring at worst it is dishonest. The problem is they aren't the only universal approximators. There is a whole slew of them, nearest neighbor approximators, polynomials, rational splines, kernel methods … Furthermore the universal approximation property holds under conditions.
Finally, the ability to represent a function arbitrarily well (approximation property) does not mean that one will be able to find the representation from data easily (learning property). Empirical evidence suggests that among the class of universal approximators we know, NNs seems easy to train effectively. Why this is so s not quite well understood.