Hacker News new | ask | show | jobs
by versteegen 1227 days ago
> It will never learn math this way, no matter how much training you give it.

Not so. Actually, (for example) the phenomenon of "grokking" is when with enough training a NN eventually experiences a phase-change from memorising data to learning the general rules underlying it [1].

Grokking isn't actually desirable, it's better that the model go more directly and quickly to learning the general rule, which is achievable in toy problems (called "comprehension" in [2]).

I feel that people seem to have forgotten that deep learning is so powerful because it performs feature/representation learning, not because it can memorise, although that's powerful too. IMO that is the proper definition of 'deep learning'.

[1] Power &al. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets https://arxiv.org/abs/2201.02177

[2] Liu &al. Towards Understanding Grokking: An Effective Theory of Representation Learning https://arxiv.org/abs/2205.10343

2 comments

NN can certainly assimilate a simple algorithm, and will be even able to do so for bigger and more complex algorithms. But I think it's mostly impractical in the current level of technology, especially in terms of speed, size, and energy efficiency.

It kinda reminds me of DeepBlue. In fact, a simple DFS has always been able to beat human in the chess, but, only in 1990s, a computer finally could beat a chess grandmaster. Reason? Because a dumb DFS is impractically slow, and the human player will die old before the computer can finish its calculation.

I believe the same goes with the current AI trend. What we have right now is rather crude. The approach itself has lots of potential, but the actual solution is yet to be found. It's really sad that people keep hyping up these partial solutions as zee AI. Whatever.

>Not so. Actually, (for example) the phenomenon of "grokking" is when with enough training a NN eventually experiences a phase-change from memorising data to learning the general rules underlying it.

Reading the paper, what they're seeming to get at is "when the dataset is algorithmic (like multiplication tables), the parameters get set in a way that appears to replicate the algorithm."

That's cool, but not what GPT is.

>I feel that people seem to have forgotten that deep learning is so powerful because it performs feature/representation learning, not because it can memorise, although that's powerful too. IMO that is the proper definition of 'deep learning'.

That's not what GPT is going.

Grokking doesn't just happen for algorithmic data, it also happens less dramatically in other datasets [3]. Grokking seems to be closely related to double descent [4], which is quite widespread. Anyway I only wanted to give grokking as an example of how memorisation doesn't preclude generalisation, it may simply precede it.

> That's not what GPT is going.

I don't follow. Of course GPT models are learning representations (but I doubt you meant to deny this), that's how they can do semantic matching of its knowledge base (memorised information) in order to generalise from it. They don't only spit out training data verbatim.

Anyway, I didn't claim any GPT variant has actually "learn[t] math", but that it's not impossible with unlimited training.

[3] Liu &al. Omnigrok: Grokking Beyond Algorithmic Data https://openreview.net/forum?id=zDiHoIWa0q1

[4] Davies &al. Unifying Grokking and Double Descent https://openreview.net/pdf?id=JqtHMZtqWm

Again, reading these papers, Grokking can happen in very limited circumstances for non-algorithmic datasets.

> They verify this observation in a student teacher setup, and show that it can arise in non-algorithmic datasets if initialized in a certain weight regime for appropriate sample size.

It’s not a widespread phenomenon by any means and it is not observably happening inside GPT. No amount of training will change that, only a drastic specialization of the training data (which defeats the purpose).

> They don't only spit out training data verbatim.

I’m not saying verbatim. But I am saying it won’t return a pattern it hasn’t seen in its dataset before. The whole point of attention is that the token isn’t just the word, but the word as it exists in context. If you expand verbatim to include that as the token, yes that is exactly what GPT does (it will not connect two tokens unless it was trained on data that implies those tokens should be connected, it know nothing else about what those tokens are)

Again to put it simply, a 3rd grader can multiply any (and I mean literally the infinite set) two numbers. GPT cannot and never will be able to multiple an infinite set of numbers.

I wrote that double descent is widespread, not grokking.

Of course a transformer can't do multiplication or any other kind of operation on an infinite set of numbers, because it has only bounded depth which limits the number of steps it can emulate of any algorithm. But I think I see how I could build a transformer by hand that could multiply any two 4-digit numbers. The difficulty is the quadratic number of steps. Addition and subtraction are far easier, [1] shows that can be solved: "By introducing position tokens (e.g., "3 10e1 2"), the model learns to accurately add and subtract numbers up to 60 digits. We conclude that modern pretrained language models can easily learn arithmetic from very few examples, as long as we use the proper surface representation". But they needed to change the input representation, otherwise finding the n-th digit would require scanning the number from the right end while counting, which seems to be difficult to learn.

But we are in partial agreement. I don't actually think transformers are great, I think they're awfully limited, but the fact that mere pattern-matching can achieve so much makes me highly optimistic about better methods, e.g. adding working memory.

[1] Investigating the Limitations of Transformers with Simple Arithmetic Tasks https://arxiv.org/abs/2102.13019