|
I can't comment on #4 but with regards to 1-2 we can make an argument that high dimensional space is generally friendly towards stochastic gradient descent. Consider a neural net that produces a single output given `N` inputs– it's basically a function `f(x) = f(x_1, x_2, …, x_N)`
A local minimum x* has the gradient `∇f(x) = (0, 0, …, 0)` and an NxN Hessian of the form `[H•f(x)]_{ij} = (∂^2 f)/(∂x_i ∂x_j)`. The critical point is a local maximum if the Hessian is positive definite at x and a local minimum if it's negative definite; it's a saddle point otherwise.
This corresponds to the Hessian having a mixture of positive and negative eigenvalues. Heuristically, we have N eigenvalues, and absent prior information we can expect that the probability of each eigenvalue being positive is 1/2, and 1/2 for the eigenvalue being negative instead.
So the probability of an arbitrary critical point being a local minimum is `(1/2)^N`, which becomes extremely small as N grows large.
Large values of N are kinda the bread-and-butter of deep learning, so we expect that most critical points we'll encounter are saddle points.
Additionally, the inherent noise in SGD means that we're unlikely to stay trapped at a saddle point, because once you're nudged away from the saddle, the rate at which you slide off it increases rapidly. So if your net appears to have converged, it's probably at a local optimum with a reasonably deep basin of attraction, assuming you're using the bag of tricks we've accumulated in the last decade (stuff like adding a bit of noise and randomizing the order in which the training data is presented). As for 3 & 5, we kinda cheat because if your model appears to have converged but is not performing adequately, we step outside of the learning algorithm and modify the structure of the neural net or tune the hyperparameters.
I don't know if we'll ever have a general theory that explains why neural nets seem to work so well, because you'd have to characterize both the possible tasks and the possible models, but perhaps we'll gradually chip away at the problem as we gather more empirical data and formalize heuristics based on those findings. |