|
|
|
|
|
by skybrian
1171 days ago
|
|
You're framing this as if there were a single yes-or-no question that we should all agree on. (Are the LLM's "any good?") But in real-world contexts, there are some tasks that just about anyone could do, others where "average" human performance isn't good enough and you need to hire an expert, and also some jobs that can only be done by machine. So it seems like the bar should be set based on what you think is necessary for whatever practical application you have in mind? If it's just a game, beating an average chess player, someone who is really good, or the best in the world are different milestones. And for chess there is an ELO ranking system that lets you answer this more precisely, too. A paper about how well chatbots do on some reasoning tests can't answer this for you. |
|
When they conclude that GPT4 "does not perform astonishingly well" - what is this compared to?
They never define what 'doing well' looks like, were not able to identify an application that does better than GPT4, and also were not able to say what a human benchmark would be if given the same task.
I can say though that I read the sample question and got it wrong too, so these aren't trivial questions we are giving GPT4.
So based on this, I just don't really understand how they can support their conclusion that it "does not perform astonishingly well".