| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nullc 4448 days ago

Another limit they don't address is that the training normally used is purely local— just a gradient descent. So even when the network can model your function well, there is no guarantee that it will find the solution.

For me ANN's always seem to get stuck on not very helpful local minima— they're not one of the first tools in my bags of tricks by far.

Often I associate them as being the sort of thing that someone who doesn't really know what they're talking about talks about. (Esp. if its clear that in their minds NN have magical powers. :) maybe they'll also mention something about "genetic algorithms")

2 comments

alkonaut 4448 days ago

> So even when the network can model your function well, there is no guarantee that it will find the solution.

If it models the function over the input domain, then it is properly trained. If it is trained to a local minima then it doesn't model the underlying function well over the whole input domain. If you have good/representative training and validation sets you will be able to tell.

> Esp. if its clear that in their minds NN have magical powers

I know that type. When dealing with ANN's you realize quickly (just like in all data science) that all of the "magic" relies on the manual work and thought that goes into washing and adapting the data. Not very sexy work, and work that requires a fair bit of knowledge about the problem domain.

> For me ANN's always seem to get stuck on not very helpful local minima

That isn't the ANN that gets stuck, it's the training algorithm (using gradient descent) that gets stuck :) Training is orthogonal to the operation of the network itself (which is just a nonlinear function in the end!). Gradient descent via error backpropagation is the most common training method for MLP's, but you could imagine doing a random/brute force algorithm that is significantly simpler to implement, but slower. Since a network is often trained once and then used repeatedly, it is often plausible to train it for several weeks if needed! A pure random search is usually not feasible, but adding randomization to a gradient descent will help. There are many ways to avoid local minima for a gradient desccent, if you have time to wait.

> maybe they'll also mention something about "genetic algorithms"

The simple error backpropagation methods only work well for normal feed-forward networks. Other topologies e.g. recurrent networks require more exotic methods. In my (limited) experience genetic algorithms are rarely efficient as a training method though.

link

nanidin 4448 days ago

Well, you could use an EA to take a stab at finding better minima :)

And correct me if I'm wrong, but isn't the cost function for a feed forward neural networks that uses a sigmoid activation function convex wrt the parameters being trained, i.e. gradient descent is guaranteed to find the global minimum when small enough of a step size is used?

link

chestervonwinch 4447 days ago

Mostly, no. Hidden units introduce non-convexity to the cost. How bout a simple counter-example?

Take a simple classifier network with one input, one hidden unit and one output and no biases. To make things even simpler, tie the two weights, i.e. make the first weight equal to the second. Now, mathematically the output of the network can be written: z=f(w * f(w * x)) where f() is the sigmoid.

Next, consider a dataset with two items: [(x_1, y_1), (x_2, y_2)] where x_i is the input and y_i is the class label, 0 or 1. Take as values: [(0.9, 1), (0.1,0)]. The cost function (loglikelihood in this case) is:

L(w) = sum_i { y_i * log( f(w * f(w * x_i)) ) + (1-y_i) * log( 1-f(w * f(w * x_i)) ) }

L(w) = log( f(w * f(w * 0.9)) ) + log( 1-f(w * f(w * 0.1)) )

Plot that last guy replacing f with the sigmoid, and you'll see the result is non-convex - there's a kink near zero.

link