|
|
|
|
|
by skybrian
1171 days ago
|
|
You’re right that they don’t compare to people at all, and the benchmarks don’t show performance on a practical application. And I agree that the last sentence isn’t great, but I don’t think it’s that important. I guess they were hoping it would do better on the benchmarks? It’s not an objective statement. You don’t read a paper for its conclusion. A good question to ask about a scientific paper is “what did they actually do?” In this case, they asked ChatGPT (presumably GPT3.5) and GPT4 a bunch of logical reasoning questions from some benchmarks and compared the benchmark scores to RoBERTa. That’s it. Running benchmarks can be useful, but how much you care about the benchmarks is up to you. Higher scores are better, so it does seem promising that GPT4 got more questions right. The scores aren’t that meaningful me, but it seems like it’s objective confirmation that GPT4 is better than previous systems on logical reasoning? Maybe the benchmark scores are more meaningful to someone else? What else have these benchmarks been used for? |
|
I think my view is just that if your paper is called "Evaluating the Logical Reasoning Ability of GPT-4" and your conclusion is "logical reasoning remains challenging for GPT4" then you should have something in your paper to back up that statement that's more objective, particularly if the findings appear to be that it performs better at logical reasoning than anything else the paper identifies to date.
It's supposed to be an academic paper, not a tumblr post.