| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by talolard 1729 days ago

A quick summary/translation for those of us who don't speak ML.

We keep hearing about these giant models like GPT3 with 1.5 billion paramaters. Parameters are the things that change when we train a model, you can think about them as degrees of freedom. If you have a lot of parameters, theory made us believe that the model would just "overfit" the training data, e.g. memorize it. That's bad, because when new data comes in in production we'd expect the model to not be able to "generalize" to it, e.g. make accurate predictions on data it hasn't seen before, because it's just memorized training data instead of uncovering the "guiding principles" of the data so to speak.

In practice, these huge models are, in laymans terms, fucking awesome and work really well e.g. they generalize and work in production. No one understands why.

This paper is a survey or overview of what "too many paramaters" are, and all the research into why these models work even though they shouldn't.

8 comments

sdenton4 1729 days ago

My big beef with a lot of the 'leading edge' ML research is that it tends to be waaaaay too focused on classification problems, and ImageNet in particular. And, last I checked, you /do/ still fight with overfitting in classification models, by cleverly choosing learning rate schedules and using early stopping schemes, 'double descent' be damned.

You can solve classification with a hash function: Hash the image, and then just memorize which label goes with which hash. You can try to dodge this obviously dodgy solution by adding augmentation to the dataset. Then you instead learn to find a representation invariant under the set of augmentations, and learn the hash of that representation. It turns out these augmentation-invariant representations are actually pretty good, so we can solve the classification problem in what looks like a general way.

However, there are many other classes of problems where the hash problem doesn't exist, because the information density of the outputs is too high to memorize in the same way. Specifically, generative models, and the sorts of predictive/infill problems used for self-supervision. In these spaces, the problems are more like: "Given this pile of augmented input, generate half a megabyte of coherent output." These kinds of problems simply don't overfit: Train a speech separation model on a big dataset, and the train+eval quality metrics will just asymptote their way up and to the right until you run out of training budget.

IX-103 1729 days ago

Memorization is only an issue if you allow it to be. If design the model with a "narrow" enough inner stage then that limits the level of detail (in terms of distinct representable values) passed to subsequent stages. This should give you an ML algorithm that consists of a fingerprint (approximates your hashing) stage followed by a classifier that works based on the fingerprint input (approximates a table lookup). Such an algorithm should not have such a problem with over-fitting was you describe.

joe_the_user 1728 days ago

Memorization is only an issue if you allow it to be.

Sure, it's a potential problem that can appear in the process implementing a deep learning solution. It's not an insurmountable problem. But the fact that still appears seems like an indication the situation in deep learning is more complicated than "overparameterization is not a problem".

quocanh 1729 days ago

When you say hash function, do you mean a cryptographic hash function? How on earth could the performance of that be anywhere near the simplest probabilistic algorithm on unseen examples?

sdenton4 1729 days ago

No, nothing cryptographic here. All I'm saying is that you can memorize the dataset by extracting a small fingerprint of each training example and associating it with an output label: ie, learn by lookup table. Then you don't need to memorize the whole training set, you just need to find/learn the fingerprinting function. With no augmentation, you might as well use MD5... With augmentation, you do need to do some actual work to learn to extract an augmentation invariant projection of the training examples, but the basic principle is the same.

underanalyzer 1729 days ago

I have nothing to do with machine learning but it seems like the hashing approach would only work if you are “training” on the evaluation set instead of a separate training set. Afaik in image net like challenges the set of labeled training images does not contain any of the evaluation images so there wouldn’t be any hashes matching any of the evaluation data.

thisiszilff 1729 days ago

Yes, you're right. You should never see the test/evaluation dataset during training so it would be impossible to "memorize" the test cases. You would get good near perfect accuracy on the training data, but not the test set. I think the closest analogue would be models that produce conceptual embeddings somewhere in them -- those are kind of like hashes with the property that similar things have similar embeddings. Many classification neural networks kind of operate like that -- the initial layers produce a representation of the data and then the final layer actually performs the classification.

quocanh 1728 days ago

Err.. hash functions like MD5 and SHA256 are "cryptographic". That just means one with a random distribution of outputs as opposed to maybe Apple's "neural" hash function which has outputs that do the "augmentation invariant projection" you speak of.

What I'm trying to say is that neural networks are "universal approximators of continuous real functions". You can think of them as finding the curve of a function which matches the data to an expected and they get their predictive power by matching the underlying "function" of the problem.

Applying a cryptographic hash function is like completely scrambling the underlying function. The only way for a neural network to match it is if it was somehow a universal approximator of a discontinuous real function. You can either do that by getting into unexplored chaos theory or making a gigantic lookup table for every single possible bit combination. The former no human being knows how to do, and the latter is impossible for even a 64 bit combination (nevermind an entire image, audio clip, or video).

sdenton4 1728 days ago

>> making a gigantic lookup table for every single possible bit combination

You don't need this to achieve zero loss on the training set, though: You only need a lookup table for the images in the train set.

We know that neural networks can do something like this (learning the lookup table) because large networks can get to zero training loss on randomly assigned labels. (I linked the paper a bit further down in the thread.) This means there's some memorization capability in the architecture, even if it's a weird emulation of some memorization strategy that we would consider easy.

The actual mechanism here is probably closer to random projection + nearest neighbor; NNs are not obviously learning crypto functions. But they /are/ learning some kind of lookup mechanism. There's some indication (see Sara Hooker's work) that in practice they use a mixture of 'reasonable' strategies and memorization for long-tail training examples. We don't know /how much/ the leading networks trained on real labels rely on memorization because we don't have any real insight into the learned structures.

(as an aside, we train neural networks for discontinuous functions all the time: Classification is discontinuous, by the nature of the labels. We turn it into a continuous+trainable problem by choosing a probabilistic framing.)

quocanh 1727 days ago

Okay but that would only work for examples with which you already have. All interesting cases of neural networks are applying it to unseen inputs. How does your technique work with unseen inputs?

And while we interpret the result of a classification as a 1 or 0, the underlying result is a continuous probability. Even in reality, our training examples are labeled with too much confidence - some labels are vague even for humans. If it approximates a discontinuous function, then it does so by approximating a continuous function. You can read here for more information: https://www.sciencedirect.com/science/article/abs/pii/089360...

r-zip 1729 days ago

Has this been implemented? What kinds of hashing functions are you talking about? How would you guarantee the same hash for all the augmentations?

It seems like the approach you describe just moves the complexity of the task solved by neural networks into the hashing function.

sdenton4 1729 days ago

https://openreview.net/forum?id=Sy8gdB9xx

"our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise."

r-zip 1722 days ago

What's your point? It's not clear at all from this post.

zibzab 1728 days ago

Fuzzy hashing is a thing and may actually use cryptographic hashing internally.

See for example recent work on hashing for malware detection.

baron_harkonnen 1729 days ago

> In practice, these huge models are, in laymans terms, fucking awesome and work really well

A similarly surprising result from an adjacent community, Bayesian Statistics, is that in the case of hierarchical models, increasing your number of parameters can paradoxically reduce overfitting.

The scale of parameters in Bayesian model's is no where near that of these deep neural nets, but nonetheless this is a similarly shocking result since typically adding parameters is penalized when model building.

It's a bit more explainable in Bayesian stats since what you're using some parameters for is limiting the impact of more granular parameters (i.e. you're learning a prior probability distribution for the other parameters, which prevents extreme overfitting in cases with less information).

I wouldn't be too surprised if eventually we realized there was a similar cause preventing overfitting is overparameterized ml models.

naomisperfume 1729 days ago

Do you have any good references of this phenomenon in hierarchical models?

bigfudge 1728 days ago

German and Hill has a brief intro and some references.

sjg007 1729 days ago

This is likely part of the reason. The only problem is said models require a lot of data but Humans can learn from a very small number of examples.

sdenton4 1728 days ago

Humans are continuously pretrained on a variety of tasks, though. Teaching a kid to say one word takes about a year...

gota 1728 days ago

And the very same system that is being trained to say a word is also being trained to recognize intent from intonation all the while. As soon as it can say it, the child will likely use that one word with different tones to mean different thing successfully.

We are insanely complex machines...

oleg_myrk 1728 days ago

Any chance you could share a link to a relevant paper?

talolard 1729 days ago

I don’t know if it’s correct , but I often think of a classification model as learning the parameters of a dirchlet distribution with the final softmax layer being a sample from it

yldedly 1729 days ago

>In practice, these huge models are, in laymans terms, fucking awesome and work really well e.g. they generalize and work in production. No one understands why.

To add nuance to this, these models are awesome at interpolation, but not so much at extrapolation. Or in different terms, they generalize very well to an IID test set, but don't generalize under (even slight) distribution shift.

The main reason for this is that these models tend to solve classification and regression problem quite differently from how humans do it. Broadly speaking, a large, flexible NN will find a "shortcut", i.e. a simple relation between some part of the input and the output, which may not be informative in the way we want; such as a watermark in the corner of an image, or statistical regularities in textures which disappear in slightly different lighting conditions. See e.g. https://thegradient.pub/shortcuts-neural-networks-love-to-ch...

I think it's fair to say that these models are great when you have an enormous dataset that covers the entire domain, but sub-Google-scale problems are usually still solved by underparametrized models (even at Google).

rich_sasha 1729 days ago

It depends. It really doesn’t take that much data to train a pretty stunning (if simple) RNN character-level “language model” that beats any n-gram. Or on mnist. ANNs really are a useful tool for a vast class of problems, many of which can be solved with comparatively little data.

Maybe your point stands, and it’s just that some domains need less data, just saying.

yldedly 1729 days ago

>ANNs really are a useful tool for a vast class of problems, many of which can be solved with comparatively little data.

For sure, it all depends on how robust the model needs to be, how strongly it needs to generalize. If your dataset covers the entire domain, you don't need a robust model. If you need strong generalization, then you need to build in stronger priors.

Take f(x) = x^2. If your model only needs to work in finite interval, you just need a decent sample that covers that interval. But if it needs to generalize outside that interval, no amount of parameters will give you good performance. Outside the boundaries of the interval, the NN will either be constant (with a sigmoid activation) or linear (with ReLU type activations).

jcranberry 1728 days ago

My sister works in the NLP arm of ML and analogized it to the Clever Hans effect.

Salgat 1729 days ago

To add to this, there's a misleading phenomenon that first occurs where the performance actually gets worse with too much data/parameters/epochs, but oddly improves again if you throw even more at the model.

jointpdf 1729 days ago

For the interested, this phenomenon is known as (deep) double descent:

https://openai.com/blog/deep-double-descent/

https://www.lesswrong.com/posts/FRv7ryoqtvSuqBxuT/understand...

(Edit: Oh, the definition appears in the abstract of the linked paper.)

nexuist 1729 days ago

Is this the ML equivalent of Dunning–Kruger effect? A model with a bit of data is too afraid of being wrong to be overconfident. A model with a bit more data is overconfident in itself and gets things wrong. Finally, a model with tons and tons of data understands the complexity of the problem set and once again becomes too afraid of being wrong.

visarga 1729 days ago

Model confidence as reported by softmax probability scores is notoriously noisy and miscalibrated. With larger models and more data the confidence estimation gets more nuanced.

988747 1729 days ago

> In practice, these huge models are, in laymans terms, fucking awesome and work really well e.g. they generalize and work in production. No one understands why.

How about the resulting weights? If most of them are close to 0, then that would mean that a part of the training is for NN to learn which of 1.5B parameters are relevant, and which are not.

rich_sasha 1729 days ago

There is something called the golden ticket theory (maybe mentioned in the paper, I’m on my phone), that says indeed that the large models are effectively ensembles of massive random models, and the top levels of the network pick the one or two that randomly happen to work.

Maybe true but even then only part of the story, kernels in CNN genuinely seem to learn features like edges and textures.

talolard 1729 days ago

There are two answers to this. First, empirically we see that the more parameters we add the better the model performs ==> Weights continue to contribute (and aren't dead) .

Second, there is a very popular paper called "The lottery ticket hypothesis" [1] that in any network you can find subnetworks that work just as well. e.g. The parameters are redundant. This was written in 2018, which is a long time ago in big NN world, so I'm not sure how it holds up to current insanity sized models.

[1]https://arxiv.org/abs/1803.03635

sdenton4 1728 days ago

A couple notes...

1) Imagine the loss surface of a given model architecture; each point on the surface corresponds to a full set of weights, and the value at the point is the model loss. So, a billion-dimensional surface, give or take. There's a massive amount of flexibility in that space. Some models in the surface are sparse, but they are adjacent to models which are just as good but not sparse at all. Likewise, if you 'rotate' a sparse model, you can end up with an entirely equivalent dense model. So, you really need additional 'pressure' on the learning problem to ensure you actually get sparsity, even if the sparsity is in some sense natural.

2) IIUC, lottery ticket kinda breaks with larger models/problems. For small enough problems, the initial random projection given by the random starting weights is already good enough to build on. For bigger + more complicated problems, you need to really adapt in early training, and so lottery ticket breaks down.

diffCtx 1729 days ago

My take as a 90s math grad (out of touch with modern teaching): Theory is useful to show human society is stagnant.

There’s an infinite number of sentences but our ML models are having tons of “success” as society relies on finite set in daily life; those that instigate commerce.

Like religion relied on an acceptable finite set of sentences, so too does our society. We’re a bunch of weird little missionaries living in one geometric world, still believing in bigger purpose.

ML isn’t really outputting novelty, it’s spewing our own inanity at us, and helping correct some bad math in engineering cases.

We’re easily mesmerized apes.

andrewnc 1729 days ago

Super pedantic comment. GPT-3 has 175 Billion parameters. GPT-2 was the 1.5 Billion model.

aomobile 1729 days ago

Thanks for the summary!