Hacker News new | ask | show | jobs
by halflings 1038 days ago
The interesting part is the sudden generalization.

Simple models predicting simple things will generally slowly overfit, and regularization keeps that overfitting in check.

This "grokking" phenomenon is when a model first starts by aggressively overfitting, then gradually prunes unnecessary weights until it suddenly converges on the one generalizable combination of weights (as it's the only one that both solves the training data and minimizes weights).

Why is this interesting? Because you could argue that this justifies using overparametrized models with high levels of regularization; e.g. models that will tend to aggressively overfit, but over time might converge to a better solution by gradual pruning of weights. The traditional approach is not to do this, but rather to use a simpler model (which would initially generalize better, but due to its simplicity might not be able to learn the underlying mechanism and reach higher accuracy).

1 comments

It's interesting that the researchers chose example problems where the minimum norm solution is the best at generalization. What if that's not the case?
Yea, this is what’s really going on here and feels like it’s been shrouded in language to make it seem more grandiose. That being said, I would believe generalization to occur from minimum norm solutions in some sense, but whether that corresponds to minimum norm weights or not is a different question, and one you probably won’t know a priori (not to mention even knowing which norm to choose).