| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by skybrian 1171 days ago

You're framing this as if there were a single yes-or-no question that we should all agree on. (Are the LLM's "any good?")

But in real-world contexts, there are some tasks that just about anyone could do, others where "average" human performance isn't good enough and you need to hire an expert, and also some jobs that can only be done by machine.

So it seems like the bar should be set based on what you think is necessary for whatever practical application you have in mind?

If it's just a game, beating an average chess player, someone who is really good, or the best in the world are different milestones. And for chess there is an ELO ranking system that lets you answer this more precisely, too.

A paper about how well chatbots do on some reasoning tests can't answer this for you.

1 comments

Closi 1171 days ago

Not a simple 'yes-or-no' question, but more about the framing and where the benchmark is.

When they conclude that GPT4 "does not perform astonishingly well" - what is this compared to?

They never define what 'doing well' looks like, were not able to identify an application that does better than GPT4, and also were not able to say what a human benchmark would be if given the same task.

I can say though that I read the sample question and got it wrong too, so these aren't trivial questions we are giving GPT4.

So based on this, I just don't really understand how they can support their conclusion that it "does not perform astonishingly well".

link

skybrian 1171 days ago

You’re right that they don’t compare to people at all, and the benchmarks don’t show performance on a practical application. And I agree that the last sentence isn’t great, but I don’t think it’s that important. I guess they were hoping it would do better on the benchmarks? It’s not an objective statement.

You don’t read a paper for its conclusion. A good question to ask about a scientific paper is “what did they actually do?” In this case, they asked ChatGPT (presumably GPT3.5) and GPT4 a bunch of logical reasoning questions from some benchmarks and compared the benchmark scores to RoBERTa. That’s it. Running benchmarks can be useful, but how much you care about the benchmarks is up to you.

Higher scores are better, so it does seem promising that GPT4 got more questions right. The scores aren’t that meaningful me, but it seems like it’s objective confirmation that GPT4 is better than previous systems on logical reasoning?

Maybe the benchmark scores are more meaningful to someone else? What else have these benchmarks been used for?

link

Closi 1171 days ago

I think we are probably just evaluating the paper on different metrics too :)

I think my view is just that if your paper is called "Evaluating the Logical Reasoning Ability of GPT-4" and your conclusion is "logical reasoning remains challenging for GPT4" then you should have something in your paper to back up that statement that's more objective, particularly if the findings appear to be that it performs better at logical reasoning than anything else the paper identifies to date.

It's supposed to be an academic paper, not a tumblr post.

link

skybrian 1170 days ago

How do you make an objective statement about how well GPT-4 does logical reasoning?

Running benchmarks seems like a reasonable way to do it. The objective statements are the benchmark results. They are there. That's the main result of the paper.

link

Closi 1169 days ago

You can make objective statements by benchmarking, but by the nature of benchmarking you need something to benchmark lower to be able to conclude that something is performing poorly.

Benchmarking is comparative - that’s the whole point - so the conclusions aren’t actually backed up by the paper.

link