| The statement "deep neural networks are not affected by poor local minima" is not really a personal opinion/theory at this point; it's the dominating consensus in the research community. These are not just theoretical results. They're theory papers trying to explain the empirical result of why neural nets don't get stuck at local minima. > Given that deep networks are highly nonlinear systems optimized by local gradient methods, why do they not seem to be affected by bad local minima? And other such results. As I said above, neural nets are obviously able to get stuck in local minima in toy examples. If you read my above comment, you'll see that that has no bearing on my initial statement. Dropout's main motivation is not to break local minima. It's to achieve better generalization. If it were the case that it was meant to break bad minima, we'd have better training loss upon adding dropout, which is obviously not true. As for SGD, we used to think that it was mainly for computational purposes. That is, we're unable to batch our entire training set at once, so we have to split into mini batches. Modern theory states more that SGD is good for avoiding sharp minima, as well as some other desirable properties. I'm not sure you're really reading my comments thoroughly nor checking out the links, so if you're actually interested in understanding what's really going on, please do some proper research on the topic. |