| Parent comment refers Double Descent in Human Learning [1, 2] which references the Deep Double Descent phenomenon in Large models [3]. The gist of double descent is that there exists a phenomenon where a _large_ model appears to over-fit as we expected from traditional ML, but then the val loss starts decreasing. The general consensus is that the model switches from memorization mode to interpolation mode which enables generalization. In essence, the large number of parameters initially fit directly to the data, but then smooth out. The reason we can continue improving the model is the fact that over-parameterized models always have a descent direction, simply due to the dimensionality of the model. The paper that the post references, shows that by trading parameters with more training steps, we can have small models that exhibit similar if not identical behaviour. [1] https://news.ycombinator.com/item?id=35683754 [2] https://chris-said.io/2023/04/21/double-descent-in-human-lea... [3] https://openai.com/research/deep-double-descent |