|
|
|
|
|
by djha-skin
1037 days ago
|
|
How is this even a shock. Anyone who so much as taken a class on this knows that even the simplest of perceptron networks, decision trees, or any form of machine learning model generalizes. That's why we use them. If they don't, it's called overfit[1], where the model is so accurate on the training data that its inferential ability on new data suffers. I know that the article might be talking about a higher form of generalization with LLMs or whatever, but I don't see why the same principle of "don't overfit the data" wouldn't apply to that situation. No, really: what part of their base argument is novel? 1: https://en.wikipedia.org/wiki/Overfitting |
|
Simple models predicting simple things will generally slowly overfit, and regularization keeps that overfitting in check.
This "grokking" phenomenon is when a model first starts by aggressively overfitting, then gradually prunes unnecessary weights until it suddenly converges on the one generalizable combination of weights (as it's the only one that both solves the training data and minimizes weights).
Why is this interesting? Because you could argue that this justifies using overparametrized models with high levels of regularization; e.g. models that will tend to aggressively overfit, but over time might converge to a better solution by gradual pruning of weights. The traditional approach is not to do this, but rather to use a simpler model (which would initially generalize better, but due to its simplicity might not be able to learn the underlying mechanism and reach higher accuracy).