|
|
|
|
|
by RC_ITR
1233 days ago
|
|
Again, reading these papers, Grokking can happen in very limited circumstances for non-algorithmic datasets. > They verify this observation in a student teacher setup, and show that it can arise in non-algorithmic datasets if initialized in a certain weight regime for appropriate sample size. It’s not a widespread phenomenon by any means and it is not observably happening inside GPT. No amount of training will change that, only a drastic specialization of the training data (which defeats the purpose). > They don't only spit out training data verbatim. I’m not saying verbatim. But I am saying it won’t return a pattern it hasn’t seen in its dataset before. The whole point of attention is that the token isn’t just the word, but the word as it exists in context. If you expand verbatim to include that as the token, yes that is exactly what GPT does (it will not connect two tokens unless it was trained on data that implies those tokens should be connected, it know nothing else about what those tokens are) Again to put it simply, a 3rd grader can multiply any (and I mean literally the infinite set) two numbers. GPT cannot and never will be able to multiple an infinite set of numbers. |
|
Of course a transformer can't do multiplication or any other kind of operation on an infinite set of numbers, because it has only bounded depth which limits the number of steps it can emulate of any algorithm. But I think I see how I could build a transformer by hand that could multiply any two 4-digit numbers. The difficulty is the quadratic number of steps. Addition and subtraction are far easier, [1] shows that can be solved: "By introducing position tokens (e.g., "3 10e1 2"), the model learns to accurately add and subtract numbers up to 60 digits. We conclude that modern pretrained language models can easily learn arithmetic from very few examples, as long as we use the proper surface representation". But they needed to change the input representation, otherwise finding the n-th digit would require scanning the number from the right end while counting, which seems to be difficult to learn.
But we are in partial agreement. I don't actually think transformers are great, I think they're awfully limited, but the fact that mere pattern-matching can achieve so much makes me highly optimistic about better methods, e.g. adding working memory.
[1] Investigating the Limitations of Transformers with Simple Arithmetic Tasks https://arxiv.org/abs/2102.13019