Hacker News new | ask | show | jobs
by hodgehog11 54 days ago
Hastie was actually lead author of an excellent paper that discusses the underlying phenomenon in the context of least-squares linear regression: https://arxiv.org/abs/1903.08560

It really isn't so mysterious once you begin to examine how the rule of thumb for the bias-variance tradeoff (remember that it is the relationship with model size that is curious, not the tradeoff itself) came to be. The easiest ways to arrive at this rule are through an information criterion like the AIC or BIC, where the model size appears in the penalty term for the log-likelihood. These criteria have a bunch of assumptions, all of which are crucial, and absolutely none of which apply for neural networks. The biggest one is that the only limiting regime is in the size of the dataset, so there are vastly more data than model parameters. Neural networks have parameter counts within a constant ratio of the number of datapoints. Another is that the model has a non-singular Hessian in a neighbourhood of the optimum. Neural networks do not have this. Once you abandon the rule of thumb and actually do the math in the appropriate limiting regimes, there's no contradiction anymore.

I've found the biggest mystery for people though is the fact that performance actually _improves_ after the interpolation threshold. This seems insane if you come at it from the point of view that the model "could have done anything" if there are more parameters than data. But this isn't true at all. The fact that you have obtained _a solution_ means that you imposed some implicit bias that guided which solution you end up in. For linear regression, that is often the minimum L2 norm solution, which _literally_ minimizes the variance keeping all else fixed. If you add more parameters to play with, obviously it should be able to minimize the variance even further, right? If the bias is zero and the variance is reduced, you get better performance. If you use a different optimizer than gradient descent, you can end up at the minimum L1 norm solution (effectively LASSO), which is well-known to perform really well regardless of the number of parameters.

Of course, linear regression is not neural network regression, and the situation in deep learning is far more complicated. But the same idea applies. Every single part of the training procedure is carefully designed to bias the obtained solution toward something with minimal variance. Stochastic optimizers (even dropout) settle in wide minima which have smaller variances. Some optimizers prioritize stronger correlations in the weights. Bottlenecks in the architecture induce low-rank solutions. Data augmentation induce known invariances that reduce variance along those directions. Convolutional designs induce regularity with respect to the input space. Neural networks are not magic; they are the product of hundreds of intentional design decisions over decades. When you increase the size of the model, all of these features are exacerbated.

Quantifying all of this in the theory is difficult because there are a lot of moving parts. But if you study a simplified model and consider each mechanism individually, the picture becomes pretty clear.