| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by xcthulhu 4452 days ago

The Universal Approximation Theorem[1] asserts that you only ever need one hidden layer, which at least asserts that "an (approximate) simplification exists".

But I can't say off the top of my head how you'd collapse an ANN just two hidden layers into 1. It's not obvious how sigmoid functions compose, but I suppose I should give it more thought...

---------- 1: http://en.wikipedia.org/wiki/Universal_approximation_theorem

1 comments

michaelochurch 4452 days ago

Although one hidden layer suffices theoretically, most of the cutting-edge work involves multilayer networks. Perhaps paradoxically, those become more "auditable" from what I've heard. (I've spoken to people with real-world experience of applying ANNs to unsolved problems, but I am not one.) What I've heard is that, with one layer, you get too much overloading (i.e. different stuff ending up at the same hidden node) to understand what it is doing, but that deeper networks are more legible (with appropriate visualization tools).

UAT says that a solution exists, but it doesn't put a limit on the number of nodes required, so it would have you doing an optimization in a space that is not just large, but of arbitrary finite dimension. It can be pretty nonconstructive (in the sense of proving "there exists" without showing how to find something) insofar as it's often non-trivial to get convergence to a working solution in reasonable time.

As for how sigmoids compose, imagine how bell-shaped curves would compose, just as you can make a painting out of bell-shaped "points" if allowed arbitrary precision/steepness. Now, the difference of two sigmoids can be bell-shaped, e.g. http://www.wolframalpha.com/input/?i=plot+y+%3D+1%2F%281%2Be... . I don't know how much this means in practice, but it establishes the possibility.

agibsonccc 4452 days ago

With respect to the deep learning networks ( as well as more traditional with just a weight matrix and bias), we can look to matrices for this. I will offer a compressed representation, not so much a way of pruning, but I explain why below.

Each single layer neural network is made up of 3 matrices, a weight matrix (connections), visible bias, and a hidden bias.

In theory, this can be represented as a flattened array.

This is what I do in deeplearning4j[1] for optimization (note: I'm the author)

The problem with pruning neural nets so to speak is this isn't really a search problem we're solving like A*. Both are graphs in a sense, but each neuron in a neural net when we have backpropagation gets updated to correct for error that it caused, rather than pruning like in Alpha Beta Pruning for Game playing AI.

I will offer one last thing and say that the way neural nets learn (especially if you stream data in to it for training rather than training all at once via online/mini batch learning) each neuron also tends to learn different components of an overall solution and some will activate more on certain feature vectors when you train them on an overall data set.

A solution to this is to set the neurons relative to the input size[2].

[1] https://deeplearning4j.org/

[2] http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf