Hacker News new | ask | show | jobs
by ninetyninenine 498 days ago
It’s stupid. You can prove that LLMs can reason by simply giving it a novel problem where no data exists and having it solve that problem.

LLMs CAN reason. Whether it can’t reason is not provable. To prove that you have to give the LLM every possible prompt that it has no data for and effectively show it never reasons and gets it wrong all the time. Not only is the proof impossible but it’s already been falsified as we have demonstrable examples of LLMs reasoning.

Literally I invite people to post prompts and correct answers to ChatGPT where it is trivially impossible for that prompt to exist in the data. Every one of those examples falsifies the claim that LLMs can’t reason.

Saying LLMs can’t reason is an overarching claim similar to the claim that humans and LLMs always reason. Humans and LLMs don’t always reason. But they can reason.

6 comments

Saying something again does not provide proof of its actual veracity. Writing it in caps does not make it true despite the increased emphasis. I default to skepticism in the face of unproven assertions: if one can’t prove that they reason then we must accept the possibility that they do not. There are myriad examples of these models failing to “reason” about something that would trivial for a child or any other human (some are even given as examples in this posts other comments). Given this and the lack of concrete proof I currently tend to agree with the Apple researchers conclusion.
Here was my test at ChatGPT 3.5.[0] I made up a novel game, and it figured it out. The test is simple, but it made me doubt absolute arguments that LLMs are not able to reason, in some way.

There is a question at the end of that comment, would love to hear other options.

[0] https://news.ycombinator.com/item?id=35442147

How does this prove reasoning? The thread you point to has several question in it that remain unanswered that ask the same question? How is this not entirely derivative too — there’s a huge number of these kind of 3-box “games” (although I don’t see this as a game really) so something very similar to this is probably in the training data a lot. Writing code to factor a number is definitely very common. Variation of this are also very common interview questions for interns (at least when I was interviewing)
Here is GPT 4o in "reasoning" mode answering the question "Are LLMs capable of reasoning?"

> It depends on how you define "reasoning."

> LLMs like me can perform certain types of reasoning, such as:

> Pattern recognition & analogy – Recognizing relationships between concepts and applying similar patterns in new contexts.

> Deductive reasoning – Applying rules to specific cases (e.g., "If all humans are mortal and Socrates is a human, then Socrates is mortal").

> Inductive reasoning – Generalizing from specific examples (though often with statistical biases).

> Common-sense reasoning (to an extent) – Inferring likely outcomes based on broad training data.

> However, there are limitations:

> No true understanding – LLMs don’t "know" things like humans do; they predict based on probabilities . > Struggles with multi-step logic – Complex reasoning that requires deep memory or long-term planning is difficult.

> Lack of real-world interaction – Without direct experience, reasoning is limited to text-based patterns.

> Confidence: 90%

> (Strong evidence suggests LLMs can perform certain types of reasoning, but they are not yet comparable to human-level reasoning.)

Would you agree with that analysis? If so, then LLMs are indeed capable of reasoning, in some ways.

It fails at deductive reasoning though. Pick a celebrity with non-famous children that don't obviously share their last name or something. If you ask it "who is the child of <celebrity>", it will get it right, because this is in its training data, probably Wikipedia.

If you ask "who is the parent of <celebrity-child-name>", it will often claim to have no knowledge about this person.

Yes sometimes it gets it right, but sometimes also not. Try a few celebrities.

Maybe the disagreement is about this?

Like if it gets it right a good amount of the time, you would say that means it's (in principle) capable of reasoning.

But I say, that if it gets it wrong a lot of the time, that means 1) it's not reasoning in situations when it gets it wrong, but also 2) it's most likely also not reasoning in situations when it gets it right.

And maybe you disagree with that, but then we don't agree on what "reasoning" means. Because I think that consistency is an important property of reasoning.

I think that if it gets "A is parent of B, implies B is child of A" wrong for some celebrity parents, but not for others, then it's not reasoning. Because reasoning would mean applying this logical construct as a rule, and if it's not consistent at that, it makes it hard to argue that it is in fact applying this logical rule instead of doing who-knows-what that happens to give the right answer, some of the time.

I was unable to find my exact "game" in google's index.

Therefore, how does my example not qualify as this, at least:

> Analogical reasoning involves the comparison of two systems in relation to their similarity. It starts from information about one system and infers information about another system based on the resemblance between the two systems.

https://en.wikipedia.org/wiki/Logical_reasoning#Analogical

Is it actually reasoning though or just pattern matching? Seems like to compare one should also “know” which your above response indicates they do not.

I guess the real question is “does moving down a stochastic gradient of probabilities suffice as reasoning to you” and my awnser is no because you don’t need reason to find the nearest neighbor in this architecture. In this case the model is not actively comparing and inferring its simply associating without “knowing”

There are many types of reasoning, and LLMs appear to do some of them.
My thread has been voted down and it’s getting stale. The few remaining people are biased towards there point of view and are unlikely to entertain anything that will trigger a change in their established world view.

Most people will use this excuse to avoid responding to or even looking at your link here. It is compelling evidence.

I’d settle for these things being able to do value comparison consistently well, play a game of tic tac toe more than once correctly or use a UI after an update and not fail horrendously to move the needle a little bit for me. People claiming these things selectively reason while also not being able to explain why seems a lot like magical thinking to me rather than entertaining the possibility you might be projecting onto something that is really damn-well engineered to make your anthropomorphize it.
I can prove LLMs can reason. You cannot prove LLMs can't reason. This is easily demonstrable. LLMs failing to reason is not proof LLMs can't reason, it's just proof that an LLM didn't reason for that prompt.

All I have to do is show you one prompt with a correct answer that cannot be arrived at with pattern matching and the prompt can only be arrived at through reasoning. One. You have to demonstrate this for EVERY prompt if you want to prove LLMs can't reason.

No I can “prove” it — look at any number of cases where LLMs can’t even do basic value comparisons despite being claimed as super intelligent. You can try and say well that’s a limitation of the technology and then I would reply — yes and that’s why I would say it’s not reasoning according the original human definition. Also you have yet to produce any evidence of reasoning and claiming you can over and over again doesn’t add to your arguments substance. I would be interested in your proof that some answer can’t be pattern matched too — at this point I wonder if we could create an non conscious “intelligence” that if large enough would be mostly able to describe anything known to us along some line of probability we couldn’t compute with our brain architecture and it could be close to 99.99999% right. Even if we had this theoretical probability-based super intelligence it still wouldn’t be “reasoning” but could be more “intelligent” than us.

I’m also not entirely convinced we can’t arrive at a reasoning system via probability only (a really cool thought experiment) but these systems do not meet the consistency/intelligence bar for me to believe this currently.

LLMs can reason they just don’t always reason.

That’s the claim everyone makes. That is a human definition if it reasoned one time correctly. That is the colloquial definition.

Someone who has brain damage can reason correctly on certain subjects and incorrectly on other subjects. This is an immensely reasonable definition. I’m not being pedantic or out of line here when I say LLMs can reason while using this definition.

Nobody is making the claim that LLMs reason like humans or are human or reason perfectly every time. Again the claim is: LLMs are capable of reasoning.

No reasoning is about applying rules of logic consistently, so if you only do it some of the time, that's not reasoning.

If I roll a die and only _sometimes_ it returns the correct answer to a basic arithmetic question, this is the exact reason why we don't say a die is doing arithmetic.

Even worse in the case of LLMs, where it's not caused by pure chance, but also training bias and hallucinations.

You can claim nobody knows the exact definition of reasoning, maybe there are some edges which aren't clearly defined because they're part of Philosophy, but applying rules of logic consistently is not something you just don't always do and still call it reasoning.

Also, LLMs are generally incapable of saying they don't know something, cannot know something, can't do something, etc. They would rather try and hallucinate. When it does that, it's not reasoning. And you also can't explain to an LLM how to figure out it doesn't know something, and then actually say it doesn't know and not make stuff up. If it was capable of reasoning you should be able to convince it using _reason_, to do exactly that.

However, you

I still think the jury is out on this given that they seem to fail on obvious things which are trivially reasoned about by humans. Perhaps they reason differently at which point I would need to understand how this reasoning is different from a humans reasoning (perhaps biological reasoning more generally?) and then I would want to consider whether one ought to call it reasoning given its differences (if there are any at the time of sampling). I understand your claim I’m just not buying it based on the current evidence and my interacting with these supposed “super intelligences” every day. I still find these tools valuable, just unable to “reason” about a concept which makes me think, as powerful and meaning filled as language is, our assumption of reasoning might just be a trick of our brain reasoning through a more tightly controlled stochastic space and us projecting the concept of reasoning onto a system. I see the COT models contort and twist language in a simulacrum of “reasoning” but any high school English teacher can tell you there is a lot of text written that appears to logically reason but doesn’t actually do anything of the sort once read with the requisite knowledge in the subject matter.
They can fail at reasoning. But they can demonstrably succeed to.

So the the statement that they CAN reason is demonstrably true.

Ok if given a prompt where the solution can only be arrived at by reasoning and the LLM gets to the solution for that single prompt, then how can you say it can't reason?

Just say it : llm are random machine. Even a broken clock is right twice a day.
Answering novel prompts isn't proof of reasoning, only pattern matching. A calculator can answer prompts it's never seen before too. If anything, I would come down on the reasoning side, at least for recent CoT models-but it's not a trivial question at all.
This is a fun thought experiment and made me reminisce on my Epistemology classes — something I think the current AI conversation would benefit greatly from. I’m super excited about what we’ve created here — less from the practical standpoint and more from a philosophical one where we get to interact with another form of distilled knowledge. It’s really too bad so much is breathless hype and grift because the philosophy student in me just wants to bask in thinking about this different form/medium/distillation of knowledge we now get to interact with. Comments like these help to reinvigorate that love though so thank you!
Are there any good Epistemology resources online? Seems like we could all benefit from this these days.
I actually just sat down to crack open MITs Theory of Knowledge and it seems promising and free: https://ocw.mit.edu/courses/24-211-theory-of-knowledge-sprin...

This also looks promising:

https://hiw.kuleuven.be/en/study/prospective/OOCP/introducti...

If you wanted something a bit different Wittgenstein’s Tractatus has always made my head spin with possibilities:

https://people.umass.edu/klement/tlp/tlp-hyperlinked.html

Then I'll come up with a prompt such that the answer can only be arrived at via reasoning. I only have to demonstrate this once to prove LLMs CAN reason.
I don’t think this is the watertight case you think it is, furthermore good luck proving with closed models that your question that’s never been asked in any form or derivation (supposedly) is not in the training data.
It’s water tight if the claim is only LLMs CAN reason.

No one is making the claim that LLMs reason like humans or always reason correctly. Ask anyone who makes a claim similar to mine. We are all ONLY making the claim that LLMs can reason correctly. That is a small claim.

The counterclaim is LLMs can’t reason and that is a vastly expansive claim that is ludicrously unprovable.

> Then I'll come up with a prompt such that the answer can only be arrived at via reasoning.

Dude, if you can formulate a question and prove an answer absolutely requires "reasoning" (defined how?) then you should drop everything and publish a paper on it immediately.

You'll have plenty of time to use your discovery to poke at LLMs after you secure your worldwide fame and recognition.

Go ahead then.
This is the count donut problem. Given a grid of 1s and 0s where 1 represents land and 0 represents water find the amount of donuts. A donut is an island with at least one hole in it. Two grid cells that are diagonal or adjacent form a barrier that water cannot cross. Count the amount of donuts in the grid.

This is a unique problem I came up with. It’s a variation on counting islands. There are actually two correct answers that are straightforward. Other answers may exist but are generally not straightforward and often wrong. One answer is mathematical the other is a leetcode style solution.

Try to solve this yourself before using ai to get a feel for how hard it is. The solution should be extremely straightforward. It’s also fun to think about. When you try to think of a solution you will invariably come up with a bunch of possible solutions that are wrong which is a strong indicator of how large the range of possible answers are. Few answers are correct but many look correct.

I give this test to candidates and I never expect the candidate to solve it because it’s one of the few algorithms that requires actual reasoning and actual creativity as I came up with it so no variation of it really exists anywhere else. You can’t pattern match for it. Out of like 50 candidates you probably get one person able to solve it in less than an hour.

It’s unlikely most people on hn will be able to solve it. If you do solve it don’t post the answer as it will become training data for the next iteration of the LLM.

I gave the prompt to o3. It solved. It generated code as well which I was too lazy to verify but it solved it correctly in the description of the algorithm involved.

There is also a 3D version of this problem where the grid is 3D. It changes the entire problem if a donut is in 3D space. It is harder and I have only found one possible solution for it. I have not tried it on an LLM.

LLMs CAN read minds. Whether it can’t read minds is not provable.

Literally I invite people to post prompts and correct answers to ChatGPT where it is trivially impossible for it to have known what number you were thinking of. Every one of those examples falsifies the claim that LLMs can’t read minds.

ok prove it. I'm thinking of a number right now between 1-10,000. Show me the number the LLM guesses. You can definitively prove this statement for me.

It's a probability problem really. The range of a prompt has billions of possibilities. If it arrived at a correct answer within that range then the probability it got there without reasoning is miniscule.

Same with this mind reading thing. Prove it.

Doesn't really seem fair that any one prompt proves your conclusion but it has to guess your exact number to prove my conclusion. Gemini guessed mine on the very first try (7) even though the range of numbers is infinite. Billions is small potatoes compared to what I've proven.
I’ll pick a prompt such that the range is vast so that if it gets the answer right the probability is so small that it must have arrived there by reasoning.
You can prove that LLMs can reason by simply giving it a novel problem where no data exists and having it solve that problem

They scan a hyperdimensional problem space whose facetness and capacity a single human is unable to comprehend. But there potentially exist a slice that corresponds to a problem that is novel to a human. LLMs are completely alien to us both in capabilities and technicalities, so talking about whether they can reason makes as much sense as if you replaced “LLMs” with “rainforests” or “antarctica”.

Reasoning is an abstract term. It doesn’t need to be similar to human reasoning. It just needs to be able to arrive at the answer through a process.

Clearly we used the term reasoning for many varied techniques. The term doesn’t narrow to specifically one form of “human” like reasoning only.

Oh, that is true. "It" doesn't have to do human reasoning, at all.

But we have to at least define "reasoning" for the given manifestation of "it". Otherwise it's just birdspeak. Because reasoning is "the action of thinking about something in a logical, sensible way", which has to happen somewhere if not finger-pointable, then at least somehow scannable or otherwise introspectable. Otherwise it's yet another omnidude in the sky who made it all so that you cannot see him, but there will be hints if you believe.

Anyway, we have to talk something specific, not handwavy. Even if you prove that they CAN reason for some definition of it, both the proof and the definition must have some predictive/scientific power, otherwise they are as useless as nil thought about it.

For example, if you prove that the reasoning is somehow embedded as a spatial in-network set of dimensions rather than in-time, wouldn't that be literally equivalent to "it just knows the patterns"? What would that term substitution actually achieve?

Well no. If you create a machine that produces output indistinguishable from the output of things we "know" can "reason" aka "humans". Then I would call that reasoning.

If the output has a low probability of occuring by random chance then it must be reason.

>For example, if you prove that the reasoning is somehow embedded as a spatial in-network set of dimensions rather than in-time, wouldn't that be literally equivalent to "it just knows the patterns"? What would that term substitution actually achieve?

I mean, this is a method many humans use to reason themselves.

A side effect of this is that a zip.exe that unzips a zip into a book that contains text indistiguishable from the output of a human must reason too.

From what I can see, you’re only massaging semantics. That is uninteresting.

No. I clearly said it must output novel things that aren’t part of the input.

In your example the book is the training data or aka the input.

> But they can reason

This isn't demonstrated yet, I would say. A good analogy is how people have used NeRFs to generate Doom levels, but when they do, the levels don't have offscreen coherence or object permanence. There's no internal engine behind the scenes making an actual Doom level. There's just a mechanism to generate things that look like outputs of that engine. In the same way, an LLM might well just be an empty shell that's good at generating outputs based on similar-looking outputs it was trained on, rather than something that can do the work of thinking about things and producing outputs. I know that's similar to "statistical parrot", but I don't think what you're saying demonstrates anything more than that.

It can be trivially demonstrated with a unique problem that doesn’t exist in the training data and an answer that is correct and has a low probability of being arrived at without reasoning.
wow this is like:

"I made a hypothesis that works with 1 to 5. if a hypothesis holds for 10 numbers, it holds for all numbers"

No. My claim is it can reason. So my claim is along the lines of it can make claims that are within bounds such as 1 to 5 or it can make claims not within those bounds.

The opposing claim unbounded. It says LLMs can't reason period. They are making the claim that it is 100% for all possible prompts.

No one is making the claim LLMs reason all the time and always. They don't. The claim is that they CAN reason.

Versus the claim that they can't which is all encompassing and ludicrous.

your claim (hypothesis): LLMs can reason

your evidence: "it works with these inputs I tried!"

...hmm seems you're not quite versed in basic mathematical proofs?

Seems you’re not well versed in basic English.

If I can reason it doesn’t mean I’m always reasoning or constantly reasoning or if I know how to do reasoning for every prompt. It just means it’s possible. How narrow or how wide that possibility is, is orthogonal to the claim itself. Please employ logic here.

Ok math guy. Imagine I said numbers can be divided. The claim is true even though there is a number that can’t be divided. Zero.

If it's only reasoning randomly how do you know when anything has been reasoned properly vs just a generated simulation of reasonable text?
We use Probability. Find a prompt that has a large range aka codomain. If it arrived at the correct answer then that the only possibility here is reasoning because the codomain is so large it cannot arrive there by random chance.

Of course make sure the prompt is unique such that it's not in the data and it's not doing any sort of "pattern matching".

So like all science we prove it via probability. Observations match with theory to a statistical degree.