Hacker News new | ask | show | jobs
by sergiosgc 4401 days ago
> The only difference between a transformation matrix and a neural network is that a neural network has at least two layers. In other words, it is two (or more) transformation matrices bolted together. For reasons that are a bit too complex to get into here, allows an NN to perform more complex transformations than a single matrix can. In fact, it turns out that an arbitrarily large NN can perform any polynomial-based transformation on the data.

Nice explanation. I need one clarification, though. Isn't matrix multiplication associative? Isn't thus any transformation defined by two matrices representable by a single matrix that is the product of the two matrices?

I am probably misunderstanding how NNs bolt matrices together.

3 comments

You apply a non-linear function (usually some sigmoid) on the output vector after each matrix product. Otherwise, you'd be correct and any multi-layer ANN could be expressed as a single layer network.
Thanks. It makes sense. The sigmoid is the activation function of the output "neuron". Unfortunately, matrix algebra here is not as useful as in computer graphics.
No problem. Actually, I personally found that a pretty intuitive understanding of linear algebra & vector calculus makes quite a lot of ML straight forward to approach geometrically.
Well,

I suspect some kind of transformation could be used to make a two level NN into a one level one. The thing is the resulting one level network might be more complex and less useful than the original two level network. Still, I think this does illustrate the limitations of multilevel networks.

Another way to see this is to notice that NNs and SVMs[1] are (approximately or exactly) equivalent [2] because they both involve the fairly simple linear and non-linear transformations we've been looking at.

[1] http://en.wikipedia.org/wiki/Support_vector_machine [2] http://www.staff.ncl.ac.uk/peter.andras/PAnpl2002.pdf

Interesting to note though that even with a linear network that can be represented by a single matrix, it can be faster, easier and converge to better results with multiple layers because the different gradient and parameter space that is presented to the optimization algorithm.