Hacker News new | ask | show | jobs
by davidguetta 1043 days ago
hierarchize would be a better term than generalize
3 comments

Anything would be better than "grokking".

From what I gather they're talking about double descent which afaik is the consequence of overparameterization leading to a smooth interpolation between the training data as opposed to what happens in traditional overfitting. Imagine a polynomial fit with the same degree as the number of data points (swinging up and down wildly away from the data) compared with a much higher degree fit that could smoothly interpolate between the points while still landing right on them.

None of this is what I would call generalization, it's good interpolation, which is what deep learning does in a very high dimensional space. It's notoriously awful at extrapolating, ie generalizing to anything without support in the training data.

double descent is a different phenomenon from grokking
Nope, they are the same, just that grokking is when the KL between the representable information of the implicit biases and the data is extremely high (i.e. the network is poorly-designed or oriented for the task).

It's an informal term that not everyone accepts. Double-descent is acceptable as it describes a general phenomenon that is a natural consequence of a phase transition during neural network training. Grokking is like, to me, the 'fetch' of neural network terms. It's not new, it adds a seeming layer of separation from double-descent (which is is -- just very delayed), and it's not really accepted by everyone.

I personally do not like it at all. Especially because language affects _our_ implicit biases about what neural networks can and cannot do. We've already seen that their capacities and performance can be pushed way beyond what we traditionally expect of them.

But to summarize, they are the same. And this is why we need good terminology, as well, because poor adoption and boosting of improper terminology induces excess regret in the information exchange surface between agents in a game-theoretic sense in this lovely landscape of the ML world.

> It's notoriously awful at extrapolating, ie generalizing to anything without support in the training data.

Scientists are also pretty lousy at making new discoveries without labs. They just need training data.

Generalize is seeing common principles, patterns, between disparate instances of a phenomena. It's a proper word for this.
That's a common mechanism to achieve generalization, but the term is a little more general (heh) than that. It specifically refers to correctly handling data that lives outside the distribution presented by the training data.

It's a description of a behavior, not a mechanism. Which may or may not be appropriate depending on whether you are talking about *what* the model does or *how* it achieves it.

Kinda fuzzy what's "in the distribution", because it depends on how deeply the model interprets it. If it understands examples outside the distribution... that kinda puts them in the distribution.

General understanding makes the information in the distribution very wide. Shallow understanding makes it very narrow. Like say recognizing only specific combinations of pixels verbatim.

I think you are misinterpreting. The distribution present in the training set in isolation (the one I'm referring to, and is not fuzzy in the slightest) is not the same thing as the distribution understood by the trained model (the one you are referring to, and is definitely more conceptual and hard to characterize in non-trivial cases).

"Generalization" is simply the theoretical measure of how much the later extends beyond the former, regardless of how that's achieved.

I'm saying how you determine the distribution in the training set depends on what the model understands and what the people who selected the dataset understand.

There's no distribution of meaning in the training set that's independent of interpretation and understanding. Aside from maybe the literal series of bits (and words and pixels) in it, as encoded.

In statistics that is not as severe a problem because you can plot how the data distribution lies in a commonly agreed upon position in one or more clearly defined and agreed upon dimensions. And you can look at the chart and talk about this shared interpretation objectively, and its distribution.

Although as a matter of fact just as often it matters what questions you asked, and how and when and whom you asked, for the distribution of answers you got. Lying with statistics is easy as it's full of hidden variables. This is why statistics is great when the data is simple and the analysis is simple, mathematical, objective, but social studies tend to yield, whatever you want them to yield.

So. What dimensions are we talking about with a self-evolved model? You have some understanding of what the data is, subjective to you. Maybe your team has some shared understanding of what the data covers, you have overlap. But the model has its own understanding, evolved independently. How much does it overlap with you? Not as much as you think.

It's a problem decades old, that people give to the model data that contains things they didn't realize it contains. They themselves didn't see that. And then get surprised by the results.

Say when an apple falls on your head, did you realize this contains the data required to describe classic mechanics? For centuries, billions of people didn't realize. To Newton it was there as clear as daylight. In the apple's fall. I know, the example is a myth, but the principle stands.

Another example, a video of the change of light patterns reflected on the floor around the corner of room where a person, out of frame, is writing on a computer. What does this data contain? You think nothing much. Maybe it contains how a floor looks. To a model, it can easily also contain what the person who is not in frame, wrote on their keyboard.

So given all this... what IS in the distribution? Depends with whose eyes you're looking. Your eyes are not the most objective eyes, nor the most intelligent eyes. You have no anchor to point to as the ultimate arbiter of what complex data contains or does not.

Generalize has a tendency to imply you can extrapolate. And in most case it's actually the opposite that happens: neural nets tend to COMPRESS the data. (which in turn is a good thing in many case because the data is noisy)
The point of compression is to decompress after. That's what happens during inference, and when the extrapolation occurs.

Let's say I tell GPT "write 8 times foobar". Will it? Well then it understands me and can extrapolate from the request to the proper response, without having specifically "write 8 times foobar" in its model.

Most decompression algorithms focus on predicting the next token (byte, term, etc.), believe it or not. The more accurately they predict the next token, the less information you need to store to correct misprediction.

"hierarchize" only describes your own mental model of how knowledge organization and reasoning may work in the model, not the actual phenomenon being observed here.

"generalize" means going from specific examples to general cases not seen before, which is a perfectly good description of the phenomenon. Why try to invent a new word?

> hierarchize" only describes your own mental model of how knowledge organization and reasoning may work in the model, not the actual phenomenon being observed here

It's not true, if you look at deep CNN the lower layers show lines, the higher complex stuff like eyes or football players etc.. Herarchisation of information actually emerges naturally in NNs.

Generalization often implies extrapolation on new data, which is just not the case most of the time with NNs and why i didn't like the word