| HN Mirror

Local minima aren't normally a problem for neural nets since they usually have a very large number of parameter, meaning that the loss/error landscape has a correspondingly high number of dimensions. You might be in a local minima in one of those dimensions, but the probability of simultaneously being in a local minima of all of them is vanishingly small.

Different learning rate schedules, as well as momentum/etc, can also help getting stuck for too long in areas of the loss landscape that many not be local minima, but may still be slow to move out of. One more modern approach is to cycle between higher and lower learning rates rather than just use monotonically decreasing ones.

I'm not sure what latest research is, but things like batch size and learning rate can certainly effect the minimum found, with some resulting in better generalization than others.