|
I would argue almost every popular benchmark quoted by the big LLM companies is tainted. OAI, xAI, Antropic, Google all score incredibly well, then you go to try and write code and its just okay. They claim it can do PHD level reasoning, but here I am not trusting it on basic computational thinking. |
Not sure that's really the claim. I think they claim that performance on benchmarks like GPQA indicate PhD level knowledge of different fields.