failed all the logic puzzles with slight tweaks - including stupid monty hall (with transparent doors). BSs with confidence.
agi is not knocking at the door.
prove that there are no non negative numbers less than 3
bullshits an answer with confidence (all llms do this)
stupid monty hall
Suppose you're on a game show, and you're given the choice of three transparent doors...
stupid river crossing
A farmer with a wolf, a goat, and a koala must cross a river by boat....
basically, these LLMs have ingested canned solutions and cant reason with newly defined concepts. Anything "out-of-the-box" and they BS canned answers - like the rote student. The BS is particularly distasteful because of the confidence projected in the answer...
So, they are great for looking-up commonly understood "in-the-box" narratives, but are poor at reasoning where there is some novelty. this is what we can expect from a probabilistic "deep" autocompleting machine. unlike a child which can learn ideas and metaphors from a few examples and anomalies.
Seeing a problem you've seen many times and have memorized and plowing through it without "concentrating" enough to see the subtle differences is a failure mode that occurs in humans as well. We don't say "humans can't reason" just because this happens so it makes little sense to say the same for LLMs. The important bit is that it can solve it if nudged from memory, same as people.
Humans are wired fundamentally to be irrational - our perceptual/cognitive apparatus is deeply flawed - umpteen studies show this - so this is a given.
But, we also discovered a way to think/model which seems to work amazingly - which is the scientific method or reasoning. But this language is not natural to the way humans operate at all. It is a struggle for most of us to think in that manner.
thats why math/science is difficult for most of us, and these were discovered only in the last 2000 years.
LLMs cannot yet represent conceptual relationships deterministically/symbolically. At some point in the future, perhaps they can, but the current generation has a long way to go.