Hacker News new | ask | show | jobs
by djha-skin 1037 days ago
How is this even a shock.

Anyone who so much as taken a class on this knows that even the simplest of perceptron networks, decision trees, or any form of machine learning model generalizes. That's why we use them. If they don't, it's called overfit[1], where the model is so accurate on the training data that its inferential ability on new data suffers.

I know that the article might be talking about a higher form of generalization with LLMs or whatever, but I don't see why the same principle of "don't overfit the data" wouldn't apply to that situation.

No, really: what part of their base argument is novel?

1: https://en.wikipedia.org/wiki/Overfitting

3 comments

The interesting part is the sudden generalization.

Simple models predicting simple things will generally slowly overfit, and regularization keeps that overfitting in check.

This "grokking" phenomenon is when a model first starts by aggressively overfitting, then gradually prunes unnecessary weights until it suddenly converges on the one generalizable combination of weights (as it's the only one that both solves the training data and minimizes weights).

Why is this interesting? Because you could argue that this justifies using overparametrized models with high levels of regularization; e.g. models that will tend to aggressively overfit, but over time might converge to a better solution by gradual pruning of weights. The traditional approach is not to do this, but rather to use a simpler model (which would initially generalize better, but due to its simplicity might not be able to learn the underlying mechanism and reach higher accuracy).

It's interesting that the researchers chose example problems where the minimum norm solution is the best at generalization. What if that's not the case?
Yea, this is what’s really going on here and feels like it’s been shrouded in language to make it seem more grandiose. That being said, I would believe generalization to occur from minimum norm solutions in some sense, but whether that corresponds to minimum norm weights or not is a different question, and one you probably won’t know a priori (not to mention even knowing which norm to choose).
There's so many idiots in the AI space that are completely ignorant of how Machine Learning works. The worst are the grifters that fearmonger about AI safety by regurgitating singularity memes.
It's because you over generalized your simple understanding. There is a lot more nuance to that thing you are calling overfitting (and underfitting). We do not know why it happens or when it happens, in all cases. We do know cases where it does happen and why it happens, but that doesn't me we don't know others. There is still a lot of interpretation left that is needed. How much was overfit? How much underfit? Can these happen at the same time? (yes) What layers do this, what causes this, and how can we avoid it? Reading the article shows you that this is far from a trivial task. This is all before we even introduce the concept of sudden generalization. Once we do that then all these things start again but now under a completely different context that is even more surprising. We also need to talk about new aspects like the rate of generalization and rate of memorization what what affects these.

tldr: don't oversimplify things: you underfit

P.S. please don't fucking review. Your complaints aren't critiques.