Hacker News new | ask | show | jobs
by baron_harkonnen 1729 days ago
> In practice, these huge models are, in laymans terms, fucking awesome and work really well

A similarly surprising result from an adjacent community, Bayesian Statistics, is that in the case of hierarchical models, increasing your number of parameters can paradoxically reduce overfitting.

The scale of parameters in Bayesian model's is no where near that of these deep neural nets, but nonetheless this is a similarly shocking result since typically adding parameters is penalized when model building.

It's a bit more explainable in Bayesian stats since what you're using some parameters for is limiting the impact of more granular parameters (i.e. you're learning a prior probability distribution for the other parameters, which prevents extreme overfitting in cases with less information).

I wouldn't be too surprised if eventually we realized there was a similar cause preventing overfitting is overparameterized ml models.

4 comments

Do you have any good references of this phenomenon in hierarchical models?
German and Hill has a brief intro and some references.
This is likely part of the reason. The only problem is said models require a lot of data but Humans can learn from a very small number of examples.
Humans are continuously pretrained on a variety of tasks, though. Teaching a kid to say one word takes about a year...
And the very same system that is being trained to say a word is also being trained to recognize intent from intonation all the while. As soon as it can say it, the child will likely use that one word with different tones to mean different thing successfully.

We are insanely complex machines...

Any chance you could share a link to a relevant paper?
I don’t know if it’s correct , but I often think of a classification model as learning the parameters of a dirchlet distribution with the final softmax layer being a sample from it