Hacker News new | ask | show | jobs
by yldedly 1730 days ago
>In practice, these huge models are, in laymans terms, fucking awesome and work really well e.g. they generalize and work in production. No one understands why.

To add nuance to this, these models are awesome at interpolation, but not so much at extrapolation. Or in different terms, they generalize very well to an IID test set, but don't generalize under (even slight) distribution shift.

The main reason for this is that these models tend to solve classification and regression problem quite differently from how humans do it. Broadly speaking, a large, flexible NN will find a "shortcut", i.e. a simple relation between some part of the input and the output, which may not be informative in the way we want; such as a watermark in the corner of an image, or statistical regularities in textures which disappear in slightly different lighting conditions. See e.g. https://thegradient.pub/shortcuts-neural-networks-love-to-ch...

I think it's fair to say that these models are great when you have an enormous dataset that covers the entire domain, but sub-Google-scale problems are usually still solved by underparametrized models (even at Google).

2 comments

It depends. It really doesn’t take that much data to train a pretty stunning (if simple) RNN character-level “language model” that beats any n-gram. Or on mnist. ANNs really are a useful tool for a vast class of problems, many of which can be solved with comparatively little data.

Maybe your point stands, and it’s just that some domains need less data, just saying.

>ANNs really are a useful tool for a vast class of problems, many of which can be solved with comparatively little data.

For sure, it all depends on how robust the model needs to be, how strongly it needs to generalize. If your dataset covers the entire domain, you don't need a robust model. If you need strong generalization, then you need to build in stronger priors.

Take f(x) = x^2. If your model only needs to work in finite interval, you just need a decent sample that covers that interval. But if it needs to generalize outside that interval, no amount of parameters will give you good performance. Outside the boundaries of the interval, the NN will either be constant (with a sigmoid activation) or linear (with ReLU type activations).

My sister works in the NLP arm of ML and analogized it to the Clever Hans effect.