| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by baron_harkonnen 1729 days ago

> In practice, these huge models are, in laymans terms, fucking awesome and work really well

A similarly surprising result from an adjacent community, Bayesian Statistics, is that in the case of hierarchical models, increasing your number of parameters can paradoxically reduce overfitting.

The scale of parameters in Bayesian model's is no where near that of these deep neural nets, but nonetheless this is a similarly shocking result since typically adding parameters is penalized when model building.

It's a bit more explainable in Bayesian stats since what you're using some parameters for is limiting the impact of more granular parameters (i.e. you're learning a prior probability distribution for the other parameters, which prevents extreme overfitting in cases with less information).

I wouldn't be too surprised if eventually we realized there was a similar cause preventing overfitting is overparameterized ml models.

4 comments

naomisperfume 1729 days ago

Do you have any good references of this phenomenon in hierarchical models?

link

bigfudge 1728 days ago

German and Hill has a brief intro and some references.

link

sjg007 1729 days ago

This is likely part of the reason. The only problem is said models require a lot of data but Humans can learn from a very small number of examples.

link

sdenton4 1729 days ago

Humans are continuously pretrained on a variety of tasks, though. Teaching a kid to say one word takes about a year...

link

gota 1728 days ago

And the very same system that is being trained to say a word is also being trained to recognize intent from intonation all the while. As soon as it can say it, the child will likely use that one word with different tones to mean different thing successfully.

We are insanely complex machines...

link

oleg_myrk 1729 days ago

Any chance you could share a link to a relevant paper?

link

talolard 1729 days ago

I don’t know if it’s correct , but I often think of a classification model as learning the parameters of a dirchlet distribution with the final softmax layer being a sample from it

link