Hacker News new | ask | show | jobs
by josehackernews 757 days ago
how do you argue that these models are not able to reason?

deductive reasoning is just drawing specific conclusion from general patterns. something I would argue this models can do (of course not always and are still pretty bad in most cases)

the point i’m trying to make is that sometimes reasoning is overrated and put on the top of the cognitive ladder, sometimes I have seen it compared to self-awareness or stuff like that. I know that you are not probably saying it in this way, just wanted to let it out.

I believe there is fundamental work still to be done, maybe models that are able to draw patterns comparing experience, but this kind of work can be useful as make us reflect in every step of what these models do, and how much the internal representation learned can be optimized

6 comments

We have no definition of reasoning that is sufficiently precise to be useful.

But we do have a bunch of benchmark tasks/datasets that test what we intuitively understand to be aspects of reasoning.

For AI models, "being able to reason" means "performing well on these benchmarks tasks/datasets".

Over time, we'll add more benchmarking tasks and datasets that ostensibly test aspects of "reasoning", and people will develop models that succeed on more and more of these simultaneously.

And these models will become more and more useful. And people will still argue over whether they are truly "reasoning".

>> deductive reasoning is just drawing specific conclusion from general patterns.

This is according to whom, please?

The fundamental argument of "Artificial Intelligence, Natural Stupidity" is that AI researchers constantly abuse terms like "reasoning," "deduction," "understanding," and so on, deluding others and themselves that their machine is almost as intelligent as a human when it's clearly dumber than a dog. My cats don't need "general patterns" to form deductions, they deduce many sophisticated things (on their terms) with n=1 data points.

In the 80s the computers were indisputably dumber than ants. That's probably not true these days. But the decades-long refusal of most AI researchers to accept humility about the limitations of their knowledge (now they describe multiple-choice science trivia as "graduate level reasoning") suggests to me that none of us will live to see an AI that's smarter than a mouse. There's just too much money and ideology, and too little falsifiability.

Drew McDermot's warning is well-heeded, but there are established and well-understood definitions of deductive, inductive and abductive reasoning that go back to at least Charles Sanders Pierce (philosopher and pioneer of predicate logic, contemporary of Gotlob Frege) that are widely accepted in AI research, and that even McDermot would have accepted. See sig for intro.
This is completely irrelevant. McDermot's point was that scientifically-plausible definitions of reasoning were not actually being used in practice by AI researchers when they made claims about their systems. That is just as true today.
I've read McDermot's paper a few times (it's a favourite of mine) and I don't remember that angle. Can you please clarify why you say that's his point?
Ants behave in ways that a modern computer still can't imitate. I don't think that generalized intelligence is possible but if it is it would need a different starting point than our current computing hardware. Even insects are flexible in ways that computers aren't.
> My cats don't need "general patterns" to form deductions, they deduce many sophisticated things (on their terms) with n=1 data points.

No they don't. That's just generalization, so they've seen plenty of other data points that are similar enough.

> Deductive reasoning is the process of drawing valid inferences. An inference is valid if its conclusion follows logically from its premises, meaning that it is impossible for the premises to be true and the conclusion to be false.

<https://en.wikipedia.org/wiki/Deductive_reasoning>

That's not the definition used by the comment above.
> how do you argue that these models are not able to reason?

They just don't have the right architecture to support it.

An LLM is just a fixed size stack of N transformer layers, and has no working memory other than the temporary activations between layers. There are always exactly N steps of "logic" (embedding transformation) put into each word output.

You can use prompts like "think step by step" to try to work around these limitations so that a complex problem can (with good planning by the model) be broken down into M steps of N layers, and the model's own output in early steps acts as pseudo-memory for later steps, but this only gets you so far. It provides a workaround for the fixed N layers and memory, but creates critical dependency on ability to plan and maintain coherency while manipulating long contexts, which are both observed weaknesses of LLMs.

Human reasoning/planning isn't a linear process of N steps - in the general case it's more like an iterative/explorative process of what-if prediction/deduction, backtracking etc, requiring working memory and focus on the task. There's a lot more to the architecture of our brain than a stack of layers - a transformer is just not up to the job, nor was built for it.

It is not «deductive reasoning»: it is just "reasoning". That is, revising a body of ideas for qualities pertinent to alethic (truthfulness) and understanding (completeness).

It is critical thinking, continuous cycles of reprocessing.

And this cannot be overrated: it is the core activity.

> how do you argue that these models are not able to reason?

I don't make this argument. Benchmarks like CLUTRR[1] show how poorly LLMs do in reasoning.

[1] https://github.com/facebookresearch/clutrr

There is a difference between poor reasoning and no reasoning. SOTA LLMs correctly answer a significant number of these questions correctly. The likelihood of doing so without reasoning is astronomically small.

Reasoning in general is not a binary or global property. You aren't surprised when high-schoolers don't, after having learned how to draw 2D shapes, immediately go on to draw 200D hypercubes.

Granting that, the original point was that they're not excited about this particular paper unless (for example) it improves the networks' general reasoning abilities.

The problem was never "my llm can't do addition" - it can write python code!

The problem is "my llm can't solve hard problems that require reasoning"

>deductive reasoning is just drawing specific conclusion from general patterns. something I would argue this models can do

That the models can't see a corpus of 1-5 digit addition then generalise that out to n-digit addition is an indicator that their reasoning capacities are very poor and inefficient.

Young children take a single textbook & couple of days worth of tuition to achieve generalised understanding of addition. Models train for the equivalent of hundreds of years, across (nearly) the totality of human achievement in mathematics, and struggle with 10-digit addition.

This is not suggestive of an underlying capacity to draw conclusions from general patterns.

> Young children take a single textbook & couple of days worth of tuition to achieve generalised understanding of addition

Maybe you did! Most young children cannot actually do bigint arithmetic reliably or at all after a couple days worth of tuition!

I think the “train for hundreds of years” argument is misleading. It’s based off of parallel compute time and how long it would take to run the same training sequentially on a single GPU. This assumes an equivalence with human thought based on the tokens per second rate of the model which is a bad measurement because it varies depending on hardware and the closest comparison you could draw to what a human brain is doing would be either the act of writing or speaking but we obviously process a lot more information and produce a higher volume of information at a much higher rate than we can speak or write. Imagine if you had to verbally direct each motion of your body, it would take an absurd amount of time to do anything depending on the specificity you had to work with.

The work done in this paper is very interesting and your dismissal of “it can’t see a corpus and then generalize to n digits” is not called for. They are training models from scratch in 24 hours per model using only 20 million samples. It’s hard to equate that to an activity a single human could do. It’s as though you had piles of accounting ledgers filled with sums and no other information or knowledge of mathematics, numbers or the world and you discovered how to do addition based on that information alone. There is no textbook or tutor helping them do this either it should be noted.

There is a form of generalization if it can derive an algorithm based on a maximum length of 20 digit operands that also works for 120 digits. Is it the same algorithm we use by limiting ourselves to adding two digits at a time? Probably not but it may emulate some of what we are doing.

>There is no textbook or tutor helping them do this either it should be noted.

For this particular paper there isn't, but all of the large frontier models do have textbooks (we can assume they have almost all modern textbooks). They also have formal proofs of addition in Principia Mathematica, alongside nearly every math paper ever produced. And still, they demonstrate an incapacity to deal with relatively trivial addition - even though they can give you a step-by-step breakdown of how to correctly perform that addition with the columnar-addition approach. This juxtaposition seems transparently at odds with the idea of an underlying understanding & deductive reasoning in this context.

>There is a form of generalization if it can derive an algorithm based on a maximum length of 20 digit operands that also works for 120 digits. Is it the same algorithm we use by limiting ourselves to adding two digits at a time? Probably not but it may emulate some of what we are doing.

The paper is technically interesting, but I think it's reasonable to definitively conclude the model had not created an algorithm that is remotely as effective as columnar addition. If it had, it would be able to perform addition on n-size integers. Instead it has created a relatively predictable result that, when given lots of domain-specific problems, transformers get better at approximating the results of those domain-specific problems, and that when faced with problems significantly beyond its training data, its accuracy degrades.

That's not a useless result. But it's not the deductive reasoning that was being discussed in the thread - at least if you add the (relatively uncontroversial) caveat that deductive reasoning should lead to correct conclusion.