Hacker News new | ask | show | jobs
by CGamesPlay 3725 days ago
Neat stuff, fun to play with. I wasn't able to get a net to classify the swiss roll. Last time I was playing around with this stuff I found the single biggest factor in the success was the optimizer used. Is this just using a simple gradient descent? I would like to see a drop down for different optimizers.
3 comments

http://imgur.com/ypBQEWx

Add some noise, and use all the inputs, and one 8 wide hidden layer

edit: works better with a sigmoid activation curve, but it converges more slowly

Yeh you're on the right track. Nice pattern emerges on this after 160 iterations.

http://playground.tensorflow.org/#activation=tanh&batchSize=...

Using syn, cos, x1, x2 with 1 six-neuron hidden layer does the trick quickly: http://imgur.com/UMv5gsH

No need to mess with noise or regularization :)

> Add some noise

This actually makes the dataset harder to fit to. It is not the same thing here as the "training with noise" method where random noise would be added to each batch, as an alternative means of Tikhonov regularization.

wih that particular data set, it looks like it really just adds more data, and more importantly, fills in the gaps along the spirals which is where my setup was having troubles.

The noise doesn't go far enough to start confusing points between different clusters, but it adds more points.

That said, my knowledge of neural nets is fairly limited.

Using all inputs and 6 layers of varying sizes. After about 500 iterations. http://i.imgur.com/x1MOpvl.jpg
Just 100 iterations, learning rate 0.03, activation tanh, regularization L2, rate 0.01. The network is 8,8,8 neurons per layer.
Using the defaults, I had success at about 300 iterations with all the inputs and 5 hidden layers, each with a decreasing number of neurons (i.e. 6,5,4,3,2).

I don't know if that's a general feature to need fewer neurons with each layer, but that seems to work here.

What were the optimization algorithms you had most success with? Were they more successful in the sense of better out-of-sample error rate or in the sense of quicker convergence (or something else)?