| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by RC_ITR 1229 days ago

>Not so. Actually, (for example) the phenomenon of "grokking" is when with enough training a NN eventually experiences a phase-change from memorising data to learning the general rules underlying it.

Reading the paper, what they're seeming to get at is "when the dataset is algorithmic (like multiplication tables), the parameters get set in a way that appears to replicate the algorithm."

That's cool, but not what GPT is.

>I feel that people seem to have forgotten that deep learning is so powerful because it performs feature/representation learning, not because it can memorise, although that's powerful too. IMO that is the proper definition of 'deep learning'.

That's not what GPT is going.

1 comments

versteegen 1228 days ago

Grokking doesn't just happen for algorithmic data, it also happens less dramatically in other datasets [3]. Grokking seems to be closely related to double descent [4], which is quite widespread. Anyway I only wanted to give grokking as an example of how memorisation doesn't preclude generalisation, it may simply precede it.

> That's not what GPT is going.

I don't follow. Of course GPT models are learning representations (but I doubt you meant to deny this), that's how they can do semantic matching of its knowledge base (memorised information) in order to generalise from it. They don't only spit out training data verbatim.

Anyway, I didn't claim any GPT variant has actually "learn[t] math", but that it's not impossible with unlimited training.

[3] Liu &al. Omnigrok: Grokking Beyond Algorithmic Data https://openreview.net/forum?id=zDiHoIWa0q1

[4] Davies &al. Unifying Grokking and Double Descent https://openreview.net/pdf?id=JqtHMZtqWm

link

RC_ITR 1228 days ago

Again, reading these papers, Grokking can happen in very limited circumstances for non-algorithmic datasets.

> They verify this observation in a student teacher setup, and show that it can arise in non-algorithmic datasets if initialized in a certain weight regime for appropriate sample size.

It’s not a widespread phenomenon by any means and it is not observably happening inside GPT. No amount of training will change that, only a drastic specialization of the training data (which defeats the purpose).

> They don't only spit out training data verbatim.

I’m not saying verbatim. But I am saying it won’t return a pattern it hasn’t seen in its dataset before. The whole point of attention is that the token isn’t just the word, but the word as it exists in context. If you expand verbatim to include that as the token, yes that is exactly what GPT does (it will not connect two tokens unless it was trained on data that implies those tokens should be connected, it know nothing else about what those tokens are)

Again to put it simply, a 3rd grader can multiply any (and I mean literally the infinite set) two numbers. GPT cannot and never will be able to multiple an infinite set of numbers.

link

versteegen 1228 days ago

I wrote that double descent is widespread, not grokking.

Of course a transformer can't do multiplication or any other kind of operation on an infinite set of numbers, because it has only bounded depth which limits the number of steps it can emulate of any algorithm. But I think I see how I could build a transformer by hand that could multiply any two 4-digit numbers. The difficulty is the quadratic number of steps. Addition and subtraction are far easier, [1] shows that can be solved: "By introducing position tokens (e.g., "3 10e1 2"), the model learns to accurately add and subtract numbers up to 60 digits. We conclude that modern pretrained language models can easily learn arithmetic from very few examples, as long as we use the proper surface representation". But they needed to change the input representation, otherwise finding the n-th digit would require scanning the number from the right end while counting, which seems to be difficult to learn.

But we are in partial agreement. I don't actually think transformers are great, I think they're awfully limited, but the fact that mere pattern-matching can achieve so much makes me highly optimistic about better methods, e.g. adding working memory.

[1] Investigating the Limitations of Transformers with Simple Arithmetic Tasks https://arxiv.org/abs/2102.13019

link