| HN Mirror

Well gradient descent doesn't do that. And the models, while big in terms of parameter data, are not nearly big enough to actually store all the training data.

Think of it in terms of updating beliefs about the target distribution. With backpropagation, you predict based on the input, and update your beliefs according to how wrong you were. So in a sense it's unsound to re-use data - your beliefs already incorporate them! And traditional overfitting is all that - it's when you use up all the information in your training data. This was many people's objection to neural nets (and I thought it was a good objection at the time, and thought myself that the future lay with more "sound" methods, which performed better on most metrics anyway at the time, rather than with dodgy biomimicry which wasn't really even similar to biological brains at all).

But yes, there are other types of overfitting if you want to get philosophical about it. It's just that the one I and everyone used to worry about, from training too much on your data, just isn't important anymore. And most of those clever principled and less-principled regularization methods just don't matter anymore!