Hacker News new | ask | show | jobs
by gccs 1766 days ago
"We trained a model, overfit on 1.4 trillion accident scenarios, and if behaved correctly on all of them."
3 comments

Unless your model actually has trillions of parameters (and it doesn't, even gpt-3 only has 175 billion) it is not even possible to overfit on 1.4 trillion training inputs. You can't actually pigeonhole it.
Suppose that you train a neural network to predict the next number in an arithmetic sequence (a, a+b, a+2b, a+3b, a+4b, ...). As input it gets two numbers, the last number and the current number and has to predict the next one.

Suppose you had 1.4 trillion examples in the following test set (using a model with 175 billion parameters):

(1,2)->3

(2,3)->4

(3,4)->5

...

Do you think it is possible to overfit and score perfect on the test set, while failing to generalize?

I think you've specified this problem in a very strange way. But if you're saying that you're trying to train on the specific dataset where a = 1 and b = 1, then your model will fit the data perfectly with 175 billion parameters. It will also fit the data perfectly with, like, 15 parameters.

If you're trying to fit to some more complex space where a and b are unknown and you're given 3 numbers in the sequence, then what you're trying to fit is `f(a, b) = a + 2(b - a)` (or 2b - a, however you want to represent it), which is a swell function, but if you only give data that can be equally represented by `f(a, b) = b + 1`, you're mis-training your model.

But you could once again do that with a model with a dozen parameters. In both cases, the issue isn't overfitting, but misrepresentative data.

I didn't specify the training set, just the test set. It's possible that your model actually models an arithmetic series. Or that it simply overfits. The point is that it doesn't require trillions of parameters to overfit to a trillion-sized test set.
What you need are more parameters than the complexity of the underlying distribution. If you drop to a linear function you're modelling, you only need a couple of parameters.

"Overfitting" is memorizing the training data instead of generalizing. The example you're providing isn't overfitting, it's just generalizing to the wrong function. Overfitting would be if the validation set was, say, 30 random values that you got right, but didn't get other values along the same lines correct.

> I didn't specify the training set, just the test set

Then unless you constructed the training set with the intent of mistraining the model, I think a training set that got good accuracy on that validation set would generalize.

> The point is that it doesn't require trillions of parameters to overfit to a trillion-sized test set.

You can't "overfit" a validation set, unless you've done something wrong. Overfitting is, by definition, learning the training set too well such that you fail to generalize to a validation set.

Overfitting is, by definition, learning a model that doesn't generalize to the distribution of inputs you care about. If your validation set has the same distribution as the inputs you care about, then your definition holds. But that's definitely not true in practice. Usually the data you collect won't be exactly representative of the conditions you're looking to test, unless your problem is very simple.
That isn't overfit, that's fit. Nothing can protect you if your training set just doesn't have any indication of the thing you want it to learn.
I didn't specify that the training set wasn't representative.

All this shows is that you don't need parameters anywhere close to the number of test examples to overfit.

And my point is that is not what overfit is. Overfit is a specific problem where the network fails to recognize a commonality in the training set and instead interprets the irrelevant details of some subset of training samples (in the extreme, individual samples) as distinct properties.

Your example training set is not filled with noise that the network is picking up on to its detriment. Your example training set is simply not representative of the function you are trying to teach.

I don't have an example training set. I don't have an example model.

My exact point is that if your test set isn't representative of the underlying distribution, then accuracy on the test set doesn't mean that your model isn't overfit.

What if the 1.4 trillion accident scenarios were the test set?
k-fold cross validation