You are wrong. What I described is exactly what they did in the "math" benchmark of v1 of the study (https://arxiv.org/pdf/2307.09009v1.pdf). They asked "is this number prime" for a bunch of prime numbers. The old version gave a bunch of "reasoning" that was actually faulty and then guessed "yes". The new version guessed "no" (which is arguably a better guess as to whether a random number is prime). In neither case did it actually do the work required to answer the question and the change in "correct" answers is an illusion.
In the programming category the newer GPT-4 actually performed significantly better but started formatting code with backticks that the study's evaluation code didn't handle properly, so they falsely concluded that it was worse. https://twitter.com/Si_Boehm/status/1681801371656536068
They later submitted a revision to the study attempting to correct these blatant issues but trusting their work is clearly a terrible idea. The study was executed very poorly and should be ignored with extreme prejudice.
At this point you've gotten like 3 or 4 replies explaining how at best you're drawing a flawed conclusion from the paper, and at worst it's a flawed paper in itself.
Funnily enough, just skimming through it again I found yet another glaring mistake they made: they left the system prompt empty for both checkpoints, yet the headline feature of the new checkpoint was improved steerability via the system prompt: https://openai.com/blog/function-calling-and-other-api-updat...
Every time I look at this paper my inclination drifts further away from harmless incompetence. Matei Zaharia is the CTO of Databricks, it feels like too perfect of a coincidence that someone who built a career on ML and study would suddenly drop the ball right as their company is trying to pivot to on-premise MLOps, who's prime competition is ChatGPT...
In the programming category the newer GPT-4 actually performed significantly better but started formatting code with backticks that the study's evaluation code didn't handle properly, so they falsely concluded that it was worse. https://twitter.com/Si_Boehm/status/1681801371656536068
They later submitted a revision to the study attempting to correct these blatant issues but trusting their work is clearly a terrible idea. The study was executed very poorly and should be ignored with extreme prejudice.