Hacker News new | ask | show | jobs
by erostrate 3724 days ago
The swiss roll problem also illustrates nicely the idea behind deep learning.

Before deep learning people would manually design all these extra features sin(x_1), x_1^2, etc. because they thought it was necessary to fit this swiss roll dataset. So they would use a shallow network with all these features like this: http://imgur.com/H1cvt8d

Then the deep learning guys realized that you don't have to engineer all these extra features, you can just use basic features x_1, x_2 and let the network learn more complicated transformations in subsequent layers. So they would use a deep network with only x_1, x_2 as inputs: http://imgur.com/XBRjROP

Both these approaches work here (loss < 0.01). The difference is that for the first one you have to manually choose the extra features sin(x_1), x_1^2, ... for each problem. And the more complicated the problem the harder it is to design good features. People in the computer vision community spent years and years trying to design good features for e.g. object recognition. But finally some people realized that deep networks could learn these features themselves. And that's the main idea in deep learning.

9 comments

I think I learned more from your post and your two imgur links than from poking at the site for an hour. Thanks.

Would it make sense for them to add a gallery of good solutions for each problem, or would they all basically be your second example network (no time to play and see for myself right now)?

>Before deep learning people would manually design all these extra features sin(x_1), x_1^2, etc.

It's probably worth pointing out that this is true for ANNs, but there were (and are) other "shallow" classifiers that can handle swiss roll problem without manual parameter encoding. SVMs, for example.

http://cs.stanford.edu/people/karpathy/svmjs/demo/

needs another image link for visualization
But how will the number of neurons N grow with the number of turns in the spiral?

If N levels off, then the network has grasped the concept of a spiral and can generalize to arbitrary size.

If N doesn't level off, then the network isn't really learning the general case.

I know this is going to sound cheesy but that's an amazing way to put it. It blew my mind.
Using their network, you are limited to 8 units per layer it seems.

So, I ported their swiss roll dataset to python and threw together a shallow network trainer with theano:

https://gist.github.com/notmatthancock/68d52af2e8cde7fbff1c9...

Then, I trained a shallow network with 36 hidden units (your deep net has 6 units and 6 layers):

http://i.imgur.com/I0pXaTK.png

edit: I forgot to mention that the shallow network above takes only the two coordinates (x1 and x2) as input features.

Just so I understand correctly: your network has 100000 iterations, while the parent's has 1000, but they both only use x / y positions?

It feels like neurons in the first layer are weaker, because all they can do is a linear separation. Given deep networks, I was wondering if adding neurons to the first layer was better than adding them to the last one, and empirically, it feels like it is quite worse. I wonder if there is a theorem around that.

> your network has 100000 iterations, while the parent's has 1000, but they both only use x / y positions

Correct, but keep in mind that their method appears to use batch descent while mine does not. Batch descent is often converges more quickly. There are other differences between my net and the GP's I can spot as well (e.g., the activation function, the learning rate, and regularization).

Also keep in mind that I threw this together over breakfast, and did not spend much time tweaking parameters :)

How do you know to choose 6 hidden layers with 6 neurons each though? Why not 'x' hidden layers with 'j' neurons each? or some other random number?

Also how do you know to choose a ReLu instead of a Tanh activation?

ReLu gives good results for deep learning: http://jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.p....

6 layers is the maximum that this demonstration allows, and they kept j small-ish to show that you don't need that many to have good results.

What I found interesting is that I couldn't get a proper fit with the same parameters you showed... however, I could 'speed up' the learning by regenerating the data during the learning process.

It may just be that 'batched cumulative learning' (I don't know if there is already a term for this) gets a better fit than just learning from a smaller set of data.

Edit: Did a quick test; regenerating about every 50 and 100 iterations, and conversion does seem faster (at least, when a clear spiral is formed). https://imgur.com/a/OPjXb

Regenerating the data is kind of cheating; it is as if you were given twice the amount of data.

In a normal situation, you obtain a list of input / output (say, images as input, a digit as output, for learning handwritten digits). You separate it between training data (which actually improves the net) and testing data (to detect overfitting), and you don't get more data than that.

Here, you can generate more data for free, as we have the function we want to approximate. Having more data will often result in a better result and faster convergence.

This is a very good explanation, thanks (even though I knew some of it already)

I tried the swiss roll with a shallow network on the demo (and the results are not excellent, but it matches)

I can reproduce your deep example just fine, but the shallow result needs some luck. At the same time, the shallow result runs faster.
Along with the images that is a very awesome explanation.