Hacker News new | ask | show | jobs
by skeptic_69 2776 days ago
1. people overfit the baby datasets to zero training loss (MNIST) all the time. maybe you meant a "hard" dataset.

2. You clearly have no idea what you are talking about. This paper is trying to argue a bit about why neural networks generalize well by showing with math that a nn with some of their conditions converges to the zero training loss. It isn't remotely meant to be practical. IT IS A THEORETICAL PAPER.

And comparing it to nearest neighbors of 1 is so so so so so silly it isn't even wrong.

edit. #1 is actually an entire research direction in the theory of machine learning fyi.

It is possible to get neural networks that massively overfit but still generalize (which Is weird).

https://arxiv.org/pdf/1611.03530.pdf

That paper was really famous. It showed you can get zero training loss on data when you replace the labels with random noise.

edit 2: I am sorry to be harsh. It is just hard to read such arrant nonsense.

2 comments

I don't see how you really addressed the concerns. You say that the paper is "trying to argue a bit about why neural networks generalize well," but in fact I don't see anything in this paper about generalization or test error. The first line of future research under section seven is to look into test error instead of training error:

"The current paper focuses on the train loss, but does not address the test loss. It would be an important problem to show that gradient descent can also find solutions of low test loss. In particular, existing work only demonstrate that gradient descent works under the same situations as kernel methods and random feature methods [Daniely, 2017, Li and Liang, 2018]."

have you heard of something called ERM? uniform convergence?

The typical way of showing generalization in ML is to show that if we have some low or zero error solution on the test data-set, for a large enough dataset, with high probability, the error on our training data set is close to the error on the real and unknown distribution. The first step which is basically "find a low error hypothesis on the training data" is called the ERM principle.

In practice we observe stochastic gradient descent works pretty well in solving the ERM problem and the solutions generalize well (perform well when deployed).

This is very weird since neural networks are really weird objects with very non-linear and non-convex behavior and gradient descent shouldn't play well with weird bumps and curves and valleys.

People want to show mathematically that stochastic gradient descent does well on neural networks.

This paper claims gradient descent is effective at minimizing quadratic loss on the training data.

If we could improve the results to show that on the true distribution we also have low loss-that might be compelling that gradient descent converges to the minimum error solution.

None of this explicitly stated since this is a well understood part of basic literature in learning theory.

Showing an algorithm can do erm on the hypothesis class is the first and (easier ) part of showing generalization.

If you want a good reference that explains this in a more coherent way I recommend looking at the first 4 chapters of understanding machine learning theory by Shai-Shalev Schwartz.

If you still think the comments I was responding to are not totally incoherent-take note of the fact that the very first sentence in the paper is "One of the mysteries in deep learning is random initialized first order methods like gradient descent achieve zero training loss"

This is an intriguingly aggressive comment.

1. No, it's impossible. Actually, the theorems in this paper do not claim to reach zero loss either, as they're all inequalities on the size of the loss. The paper you cite refers to converging to zero loss, as do you in point 2. Perhaps you're referring to error, which is not the loss that is directly optimized.

2. This paper certainly isn't talking about generalization. It doesn't appear to be mentioned once. Your other paper is talking about generalization. The parent asked if this paper is super important. I gave a reason why it isn't super important for most people.

3. Massively overfitting is antithetical to generalizing. Overfitting means fitting to the extent that you're generalizing less well.

1.mmmmmmmmm ok I am willing to accept you meant the quadratic loss instead of 0-1 error. that seems reasonable.

2. this is paper is centered in a research thrust that IS focused on generalization. see my below comment.

I don't know who most people are but this paper COULD be important in understanding why stochastic gradient works well in practice.

Personally I doubt it very much.

3. massively overfitting to the training dataset BUT generalizing well is a real phenomenon and yes it is very weird. happens in deep nets and i believe adaboost. i.e. continuing to train after you have zero 0-1 loss. I agree this is a weird way to communicate this idea but that is what the community uses.