| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cs702 1358 days ago

Thank you.

As you probably know, the big deal about double descent is that once sufficiently large AI models cross the so-called "interpolation threshold" in training, and get over the hump, they start generalizing better -- the opposite of overfitting. State-of-the-art performance in fact requires getting over the hump. As far as I can tell, you did not mention any of that explicitly anywhere in your post.

Also, all your plots show only the classical overfitting curve, not the actual curve we now see all the time with larger AI models like Transformers.

1 comments

jaschasd 1358 days ago

It's true that I don't go into detail about double descent, though I do describe how increasing capacity often reduces overfitting.

I believe the figure labeled "Figure 1" illustrates what your are suggesting (despite being labeled Figure 1, it is actually at the bottom of the blog post, so maybe easy to miss).

link

cs702 1358 days ago

> It's true that I don't go into detail about double descent, though I do describe how increasing capacity often reduces overfitting.

I agree.

> I believe the figure labeled "Figure 1" illustrates what your are suggesting (despite being labeled Figure 1, it is actually at the bottom of the blog post, so maybe easy to miss).

Easy to miss, yes. I'm not sure it illustrates the phenomenon, though. That plot shows extreme overfitting (i.e., interpolation) by the 10,000 parameter model. No one really understands what actually happens after interpolation. There's in fact some anecdotal evidence that after crossing the interpolation threshold, large AI models trained with SGD gradually begin to ignore outliers and find simpler models (!) that generalize better (!). Counterintuitive, I know. This is an active area of research, with no good explanations yet, AFAIK.

link

jaschasd 1358 days ago

(the plot shows extreme overfitting with a 10 parameter model, and interpolation with a 10,000 parameter model)

link

cs702 1358 days ago

Interpolation == extreme overfitting.

Double descent phenomenon is what happens after interpolation.

RESPONDING TO YOUR LAST COMMENT (after reaching thread depth limit):

Think of it this way: Why and how does the model's performance continue to improve on previously unseen samples after the model has fully overfit (interpolated between) all training samples? Interpolation is not the end-point in training, but a temporary threshold after which models learn to generalize better, improving on interpolation. How is it that these models improve on interpolation?

link

jaschasd 1358 days ago

I can't reply directly -- is there a maximum thread depth, or a maximum conversation depth?

Anyway -- I wanted to apologize for misreading -- I missed the parenthetical "interpolation" in your comment. I think we are both interpreting the plot the same way.

In terms of your comment about anecdotal evidence -- are you talking about the case where data and model size are increased jointly? If so, I agree, though I don't think that is any longer cleanly to do with double descent/overparameterization.

link