|
|
|
|
|
by nilkn
2777 days ago
|
|
I don't see how you really addressed the concerns. You say that the paper is "trying to argue a bit about why neural networks generalize well," but in fact I don't see anything in this paper about generalization or test error. The first line of future research under section seven is to look into test error instead of training error: "The current paper focuses on the train loss, but does not address the test loss. It would be an important problem to show that gradient descent can also find solutions of low test loss. In particular, existing work only demonstrate that gradient descent works under the same situations as kernel methods and random feature methods [Daniely, 2017, Li and Liang, 2018]." |
|
The typical way of showing generalization in ML is to show that if we have some low or zero error solution on the test data-set, for a large enough dataset, with high probability, the error on our training data set is close to the error on the real and unknown distribution. The first step which is basically "find a low error hypothesis on the training data" is called the ERM principle.
In practice we observe stochastic gradient descent works pretty well in solving the ERM problem and the solutions generalize well (perform well when deployed).
This is very weird since neural networks are really weird objects with very non-linear and non-convex behavior and gradient descent shouldn't play well with weird bumps and curves and valleys.
People want to show mathematically that stochastic gradient descent does well on neural networks.
This paper claims gradient descent is effective at minimizing quadratic loss on the training data.
If we could improve the results to show that on the true distribution we also have low loss-that might be compelling that gradient descent converges to the minimum error solution.
None of this explicitly stated since this is a well understood part of basic literature in learning theory.
Showing an algorithm can do erm on the hypothesis class is the first and (easier ) part of showing generalization.
If you want a good reference that explains this in a more coherent way I recommend looking at the first 4 chapters of understanding machine learning theory by Shai-Shalev Schwartz.
If you still think the comments I was responding to are not totally incoherent-take note of the fact that the very first sentence in the paper is "One of the mysteries in deep learning is random initialized first order methods like gradient descent achieve zero training loss"