| HN Mirror

You are wrong. What I described is exactly what they did in the "math" benchmark of v1 of the study (https://arxiv.org/pdf/2307.09009v1.pdf). They asked "is this number prime" for a bunch of prime numbers. The old version gave a bunch of "reasoning" that was actually faulty and then guessed "yes". The new version guessed "no" (which is arguably a better guess as to whether a random number is prime). In neither case did it actually do the work required to answer the question and the change in "correct" answers is an illusion.

In the programming category the newer GPT-4 actually performed significantly better but started formatting code with backticks that the study's evaluation code didn't handle properly, so they falsely concluded that it was worse. https://twitter.com/Si_Boehm/status/1681801371656536068

They later submitted a revision to the study attempting to correct these blatant issues but trusting their work is clearly a terrible idea. The study was executed very poorly and should be ignored with extreme prejudice.