|
|
|
|
|
by Closi
1171 days ago
|
|
Not a simple 'yes-or-no' question, but more about the framing and where the benchmark is. When they conclude that GPT4 "does not perform astonishingly well" - what is this compared to? They never define what 'doing well' looks like, were not able to identify an application that does better than GPT4, and also were not able to say what a human benchmark would be if given the same task. I can say though that I read the sample question and got it wrong too, so these aren't trivial questions we are giving GPT4. So based on this, I just don't really understand how they can support their conclusion that it "does not perform astonishingly well". |
|
You don’t read a paper for its conclusion. A good question to ask about a scientific paper is “what did they actually do?” In this case, they asked ChatGPT (presumably GPT3.5) and GPT4 a bunch of logical reasoning questions from some benchmarks and compared the benchmark scores to RoBERTa. That’s it. Running benchmarks can be useful, but how much you care about the benchmarks is up to you.
Higher scores are better, so it does seem promising that GPT4 got more questions right. The scores aren’t that meaningful me, but it seems like it’s objective confirmation that GPT4 is better than previous systems on logical reasoning?
Maybe the benchmark scores are more meaningful to someone else? What else have these benchmarks been used for?