| > Would you say that GPT-4 can reason now? Let's assume reasoning entails going beyond the stochastic parrot level. Can LLMs have skills not demonstrated in the training set? Here is a paper demonstrating that GPT-4 can combine up to 5 skills from a set of 100, effectively covering 100^5 tuples of skills, while only seeing much fewer combinations in training on a specific topic. > simple probability calculations indicate that GPT-4's reasonable performance on k=5 is suggestive of going beyond "stochastic parrot" behavior (Bender et al., 2021), i.e., it combines skills in ways that it had not seen during training
https://arxiv.org/abs/2310.17567 So they show ability to freely combine skills, and the limit of k=5 measured in this benchmark illustrate that models do generalize. They are able to apply skills in new combinations correctly, but there is also a limit. The interesting part is how they demonstrate that, let's say on a topic with n=1000 samples in the training set it is impossible to have sufficient training examples covering tuples of 5 skills, but models (mostly GPT-4) can handle it. Other models top out at tuples of only 2 or 3 skills. Models combining skills in new ways are not just parroting. They can perform meaningful work outside their training distribution. |
I have a hunch these models are approximating an important subset of what we call reasoning. In dangerously reductive terms, it's a question of how closely and how much of a function's output we can approximate.
There was at least one paper[1] showing similarities between AI models and the hippocampus. That lines up with another part of human neuroscience: at least part of human reasoning appears to take place inside the hippocampus itself [2].
From my neuroscience background, the takeaways seem to be:
* Carmack is right: we're missing some important bridging concepts for AGI.
* Whether current LLMs can reason depends on how you define reasoning
I'm unsure whether finding answers in those areas would be good thing. Instead of alignment issues or misuse, I'm more worried about how quickly people would overreact to it. We might already be seeing that in business.
1. https://arxiv.org/abs/2103.07356
2. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3312239