|
|
|
|
|
by HarHarVeryFunny
825 days ago
|
|
Calculus is all you need! Neural nets are trained to minimize their errors (what they actually output vs what we want them to output). When we build a neural net we know the function corresponding to the output error, so training them (finding the minimum of the error function) is done just by following the gradient (derivative) of the error function. |
|
It is clear enough that following gradients of a bounded differentiable function can bring you to a local minimum of the function (unless I guess if there’s a path that heads away from starting location, going off to infinity, along which the function is always decreasing, asymptotically approaching some value, but this sort of situation can be prevented by adding loss terms that penalize parameters being too big).
But, what determines whether it reaches a global minimum? Or, if it doesn’t reach a global minimum, what kinds of local minima are there, and what determines which kinds it is more likely to end up in? Does including momentum and stochastic stuff in the gradient descent influence the kinds of local minima that are likely to be approached? If so, in what way?