| I think one main failure in the framing of these papers (and discussion of LLMs more broadly) is that the abstract says that GPT4 ‘struggles’ with logical reasoning: > ChatGPT and GPT-4 do relatively well on well-known datasets […] however, the performance drops significantly when handling newly released and out-of-distribution [where] Logical reasoning remains challenging for ChatGPT and GPT-4 But reading the paper the challenges it is failing on are ones that I wager the average human would fail on too (at least a good portion of the time). The paper might strictly be accurate, but I think we should try and bring these papers back to a real-world context - which is that it’s probably operating above your average human at these tasks. Is superhuman/genius-level capability really required before we say the LLMs are any good? (I see this view on HN too - statements like ‘LLMs can’t create novel maths theorems!’ as an argument that LLMs aren’t good at reasoning, disregarding that most humans today can’t find novel/undiscovered maths theorems) |
However it's important to note one VERY important thing -- this is not a system that is designed to reason! At all, as far as I know. That just fell out of its ability for language somehow. So to just accidentally be able to reason like a 4 year old human (which are vastly clever compared to the adult of any other animal species I'm aware of) is incredibly impressive and I think the next obvious step is to couple this tech together with some classic computing, which has far exceeded human capabilities for logic and reason for decades already. If ChatGPT has some secondary system for reasoning and just uses the LLM for setting up problems and reading results, I think it could reach superhuman levels of reasoning quite easily.