|
|
|
|
|
by drdeca
818 days ago
|
|
I think there are still open questions about this that are worth asking. It is clear enough that following gradients of a bounded differentiable function can bring you to a local minimum of the function (unless I guess if there’s a path that heads away from starting location, going off to infinity, along which the function is always decreasing, asymptotically approaching some value, but this sort of situation can be prevented by adding loss terms that penalize parameters being too big). But, what determines whether it reaches a global minimum? Or, if it doesn’t reach a global minimum, what kinds of local minima are there, and what determines which kinds it is more likely to end up in? Does including momentum and stochastic stuff in the gradient descent influence the kinds of local minima that are likely to be approached? If so, in what way? |
|
Different learning rate schedules, as well as momentum/etc, can also help getting stuck for too long in areas of the loss landscape that many not be local minima, but may still be slow to move out of. One more modern approach is to cycle between higher and lower learning rates rather than just use monotonically decreasing ones.
I'm not sure what latest research is, but things like batch size and learning rate can certainly effect the minimum found, with some resulting in better generalization than others.