Hacker News new | ask | show | jobs
by Closi 1171 days ago
I think one main failure in the framing of these papers (and discussion of LLMs more broadly) is that the abstract says that GPT4 ‘struggles’ with logical reasoning:

> ChatGPT and GPT-4 do relatively well on well-known datasets […] however, the performance drops significantly when handling newly released and out-of-distribution [where] Logical reasoning remains challenging for ChatGPT and GPT-4

But reading the paper the challenges it is failing on are ones that I wager the average human would fail on too (at least a good portion of the time).

The paper might strictly be accurate, but I think we should try and bring these papers back to a real-world context - which is that it’s probably operating above your average human at these tasks.

Is superhuman/genius-level capability really required before we say the LLMs are any good?

(I see this view on HN too - statements like ‘LLMs can’t create novel maths theorems!’ as an argument that LLMs aren’t good at reasoning, disregarding that most humans today can’t find novel/undiscovered maths theorems)

4 comments

If you really force it to reason, rather than regurgitate arguments from its training set, you will find it is nowhere near the genius line. Make up some rules and have it try to answer questions according to the rules. In my experiments I feel it's something like a 4 or 5 year old child both in its logical limitations and penchant for distraction.

However it's important to note one VERY important thing -- this is not a system that is designed to reason! At all, as far as I know. That just fell out of its ability for language somehow. So to just accidentally be able to reason like a 4 year old human (which are vastly clever compared to the adult of any other animal species I'm aware of) is incredibly impressive and I think the next obvious step is to couple this tech together with some classic computing, which has far exceeded human capabilities for logic and reason for decades already. If ChatGPT has some secondary system for reasoning and just uses the LLM for setting up problems and reading results, I think it could reach superhuman levels of reasoning quite easily.

Agreed, but let's not forget that Carnap started his AI company last year with the express goal of reaching AGI comparable to a 'retarded toddler' by 2030. Relatively simple generative AIs have come far far further than anyone really anticipated, and it is quite unclear if the last 20% to avg-human-level AGI will be much harder, impossible, or also suddenly be solved. I mean, hell, GPT4s context space is still relatively small, it doesn't have a memory, and is still producing quite impressive results in simple reasoning tasks.
Re: your last sentence, I'm fairly certain this would fall under the category of neuro-symbolic AI. It, too, seems to me like the logical next step.

https://en.m.wikipedia.org/wiki/Neuro-symbolic_AI

> this is not a system that is designed to reason

For what it's worth, neither are we, really. Not disagreeing with anything you're saying, just musing.

> superhuman levels of reasoning

This one has always stumped me a bit though. I'm not quite sure what that looks like. Laplace's Demon?

The goal posts for AI are moving quickly, and in my mind, a lot of the criticism os too shallow.

People want it to perform better than any expert human at any possible subject before it's considered "real AI". It isn't enough for critics for it to be better than the average person at virtually everything its put to the test on.

It seems like there is some resentment and almost anger at this technology, particularly with the artistic AIs like Midjourney. I can understand that more readily, but what's the real beef with ChatGPT?

> It seems like there is some resentment and almost anger at this technology, particularly with the artistic AIs like Midjourney. I can understand that more readily, but what's the real beef with ChatGPT?

People seem to have a real tough time accepting that human brains might not be that special. They see things like GPT-4, and tend to fall into soothing mental traps to rationalize that innate but baseless rejection. I actually view all the sustained anger and resentment as a signal that we are making meaningful inroads into AGI, as it means that people are actually being impacted.

One of the most common mental traps is "It's just fancy autocomplete." People tend to stop there and don't proceed to consider that the veracity of that claim is irrelevant. Autocomplete or not, GPT-4 seems to be able to provide meaningful assistance to certain workflows that were previously only within the bounds of human cognition.

> People want it to perform better than any expert human at any possible subject before it's considered "real AI". It isn't enough for critics for it to be better than the average person at virtually everything its put to the test on.

It's quite amusing that some people have moved their goalposts to "well it's not a superintelligence, therefore it's worthless". Simultaneously, it's highly depressing, because it means various actors will likely achieve AGI while the rest of us are still bickering about autocomplete and Chinese rooms.

What we are seeing is the inevitable backlash against a program that at first glance can do literally everything you ask it to in plain English.

We don't exactly know what it can and can't do, a property which in a computer program at any rate is deeply mysterious and unusual. It initially gives the appearance of being a human which knows everything. This leads a lot of people to angrily declare that its appearance is deceptive, and in searching for words to describe in exactly what way it falls short, they incorporate flawed intuition on what it is capable of. So there's a lot of back and forth right now as we collectively swap memes to try and make sense of such a dramatic development.

I think ChatGPT's user interface is particularly suited for confusion and debate about that. We've called obviously-more-specifically-purpose-built things "AI" or "enhanced with AI" or such for years, somewhat interchangeably with other terms like "deep learning" or "machine learning." There's that old saw about "it's AI until it works and then it's just an computer science" or somesuch.

And many of those things are worse at their task than a person except for speed and scaling. Can a machine be fooled by dazzle for recognizing a face in a way a human can't? Sure, but nobody is willing to pay for a human to go through everyone's photo albums...

But does ChatGPT "use AI" as a tool in the same sense that Spotify's recommendations "use AI" or is it "an AI" in the sense that it's an independent consciousness/agent?

This is the first time so many people have disagreed on that part. And that skews the debate into "a person is better" vs talking about if a person is even practical in most of the situations we'd use this.

You're framing this as if there were a single yes-or-no question that we should all agree on. (Are the LLM's "any good?")

But in real-world contexts, there are some tasks that just about anyone could do, others where "average" human performance isn't good enough and you need to hire an expert, and also some jobs that can only be done by machine.

So it seems like the bar should be set based on what you think is necessary for whatever practical application you have in mind?

If it's just a game, beating an average chess player, someone who is really good, or the best in the world are different milestones. And for chess there is an ELO ranking system that lets you answer this more precisely, too.

A paper about how well chatbots do on some reasoning tests can't answer this for you.

Not a simple 'yes-or-no' question, but more about the framing and where the benchmark is.

When they conclude that GPT4 "does not perform astonishingly well" - what is this compared to?

They never define what 'doing well' looks like, were not able to identify an application that does better than GPT4, and also were not able to say what a human benchmark would be if given the same task.

I can say though that I read the sample question and got it wrong too, so these aren't trivial questions we are giving GPT4.

So based on this, I just don't really understand how they can support their conclusion that it "does not perform astonishingly well".

You’re right that they don’t compare to people at all, and the benchmarks don’t show performance on a practical application. And I agree that the last sentence isn’t great, but I don’t think it’s that important. I guess they were hoping it would do better on the benchmarks? It’s not an objective statement.

You don’t read a paper for its conclusion. A good question to ask about a scientific paper is “what did they actually do?” In this case, they asked ChatGPT (presumably GPT3.5) and GPT4 a bunch of logical reasoning questions from some benchmarks and compared the benchmark scores to RoBERTa. That’s it. Running benchmarks can be useful, but how much you care about the benchmarks is up to you.

Higher scores are better, so it does seem promising that GPT4 got more questions right. The scores aren’t that meaningful me, but it seems like it’s objective confirmation that GPT4 is better than previous systems on logical reasoning?

Maybe the benchmark scores are more meaningful to someone else? What else have these benchmarks been used for?

I think we are probably just evaluating the paper on different metrics too :)

I think my view is just that if your paper is called "Evaluating the Logical Reasoning Ability of GPT-4" and your conclusion is "logical reasoning remains challenging for GPT4" then you should have something in your paper to back up that statement that's more objective, particularly if the findings appear to be that it performs better at logical reasoning than anything else the paper identifies to date.

It's supposed to be an academic paper, not a tumblr post.

How do you make an objective statement about how well GPT-4 does logical reasoning?

Running benchmarks seems like a reasonable way to do it. The objective statements are the benchmark results. They are there. That's the main result of the paper.

You can make objective statements by benchmarking, but by the nature of benchmarking you need something to benchmark lower to be able to conclude that something is performing poorly.

Benchmarking is comparative - that’s the whole point - so the conclusions aren’t actually backed up by the paper.

I think a lot of these LLM benchmarks should include a human avg, otherwise I don't really have a frame of reference other than personal experience with the models.
Human average can be a misleading statistic, because the average human is useless for almost everything. In almost every job, the average person doing the job is well above the average (in the general population) for that particular job.
I think this just demonstrates how the goalposts are shifting though.

Until pretty recently most people would probably say “the average human is very flexible at solving reasoning tasks compared to machines which find reasoning incredibly challenging“.

Now it’s “well of course this AI which wasn’t specifically trained for verbal reasoning can beat an average human at verbal reasoning - humans are useless at almost everything!”

Your goalpost seems to be that GPT needs to be better than experts in their field to be considered “good” at something - but I think it’s just interesting to reflect that that’s the benchmark we are applying now.