| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by skybrian 1171 days ago

You’re right that they don’t compare to people at all, and the benchmarks don’t show performance on a practical application. And I agree that the last sentence isn’t great, but I don’t think it’s that important. I guess they were hoping it would do better on the benchmarks? It’s not an objective statement.

You don’t read a paper for its conclusion. A good question to ask about a scientific paper is “what did they actually do?” In this case, they asked ChatGPT (presumably GPT3.5) and GPT4 a bunch of logical reasoning questions from some benchmarks and compared the benchmark scores to RoBERTa. That’s it. Running benchmarks can be useful, but how much you care about the benchmarks is up to you.

Higher scores are better, so it does seem promising that GPT4 got more questions right. The scores aren’t that meaningful me, but it seems like it’s objective confirmation that GPT4 is better than previous systems on logical reasoning?

Maybe the benchmark scores are more meaningful to someone else? What else have these benchmarks been used for?

1 comments

Closi 1170 days ago

I think we are probably just evaluating the paper on different metrics too :)

I think my view is just that if your paper is called "Evaluating the Logical Reasoning Ability of GPT-4" and your conclusion is "logical reasoning remains challenging for GPT4" then you should have something in your paper to back up that statement that's more objective, particularly if the findings appear to be that it performs better at logical reasoning than anything else the paper identifies to date.

It's supposed to be an academic paper, not a tumblr post.

link

skybrian 1170 days ago

How do you make an objective statement about how well GPT-4 does logical reasoning?

Running benchmarks seems like a reasonable way to do it. The objective statements are the benchmark results. They are there. That's the main result of the paper.

link

Closi 1169 days ago

You can make objective statements by benchmarking, but by the nature of benchmarking you need something to benchmark lower to be able to conclude that something is performing poorly.

Benchmarking is comparative - that’s the whole point - so the conclusions aren’t actually backed up by the paper.

link