|
|
|
|
|
by halflings
1038 days ago
|
|
The interesting part is the sudden generalization. Simple models predicting simple things will generally slowly overfit, and regularization keeps that overfitting in check. This "grokking" phenomenon is when a model first starts by aggressively overfitting, then gradually prunes unnecessary weights until it suddenly converges on the one generalizable combination of weights (as it's the only one that both solves the training data and minimizes weights). Why is this interesting? Because you could argue that this justifies using overparametrized models with high levels of regularization; e.g. models that will tend to aggressively overfit, but over time might converge to a better solution by gradual pruning of weights. The traditional approach is not to do this, but rather to use a simpler model (which would initially generalize better, but due to its simplicity might not be able to learn the underlying mechanism and reach higher accuracy). |
|