Hacker News new | ask | show | jobs
by jhanschoo 43 days ago
My intuitive understanding about double descent is that

1. Older ML models encoded in their architecture and lack of expressivity a bias to simplicity; which aided interpolation.

2. Overparameterized models instead use regularization to nudge parameters to simpler and more robust representations, while still memorizing the noise. In this manner, we still achieve generalization performance OOD. Moreover, the softer nudging and fundamental architectural expressivity allows for "data-specific" generalizations and representations that may be impossible to represent in small models. 3. At the critical point between the two regimes, the model is expressive enough to memorize; but not expressive enough to simultaneously both do that and encode general patterns.

I wonder how this understanding translates to these researchers' models of deep learning.