Hacker News new | ask | show | jobs
by bsmith89 1734 days ago
I'm not sure if it's related, but I've seen discussions of modern ML methods (in particular those trained using stochastic algorithms...maybe also models with low float precision...?) approximating Bayesian methods. The way I've imagined it is that the training path, by virtue of its stochasticity, resembles MCMC sampling and therefore tends to end up in regions of high posterior volume (the "typical set"), rather than high posterior density. I could see this resulting in a fit with parameters closer to their conditional expectations (in the Bayesian sense), which should be more generalizable to new data, hence fewer issues with overfitting.

A consequence of this would be that if somehow a method were able to successfully find the _global_ loss-function minimum on the training data, it would perform worse on the the test set. Fortunately, our optimization methods _don't_ find the global minimum at all.

Can anybody point me to literature on this idea? I don't know if my uninformed interpretation is actually close to what experts are thinking.

2 comments

You're looking for flat minima / wide basins. (Amusingly, this one actually does go back to Schmidhuber etc.) Explains a lot of phenomenon like poorer generalization of second-order optimizers, SGD sometimes working surprisingly better, stochastic weight averaging / EMA, grokking, or patient teachers.
Oh, and I enjoyed reading this primer on the Double Descent Phenomenon for anybody, like me, who hadn't heard of it before: https://openai.com/blog/deep-double-descent/