| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by modeless 1034 days ago
	Isn't this the study that asked a bunch of questions with the same answer ("yes") and basically the old model always answered "yes" and the new model always answered "no"? That's not a degradation in performance. It was never answering the questions in the first place, just guessing. The only thing that changed was the default guess.

1 comments

esperent 1034 days ago

No it's not. Here's a link to the study, you can check the questions they asked.

https://arxiv.org/pdf/2307.09009.pdf

link

modeless 1034 days ago

You are wrong. What I described is exactly what they did in the "math" benchmark of v1 of the study (https://arxiv.org/pdf/2307.09009v1.pdf). They asked "is this number prime" for a bunch of prime numbers. The old version gave a bunch of "reasoning" that was actually faulty and then guessed "yes". The new version guessed "no" (which is arguably a better guess as to whether a random number is prime). In neither case did it actually do the work required to answer the question and the change in "correct" answers is an illusion.

In the programming category the newer GPT-4 actually performed significantly better but started formatting code with backticks that the study's evaluation code didn't handle properly, so they falsely concluded that it was worse. https://twitter.com/Si_Boehm/status/1681801371656536068

They later submitted a revision to the study attempting to correct these blatant issues but trusting their work is clearly a terrible idea. The study was executed very poorly and should be ignored with extreme prejudice.

link

BoorishBears 1034 days ago

At this point you've gotten like 3 or 4 replies explaining how at best you're drawing a flawed conclusion from the paper, and at worst it's a flawed paper in itself.

Funnily enough, just skimming through it again I found yet another glaring mistake they made: they left the system prompt empty for both checkpoints, yet the headline feature of the new checkpoint was improved steerability via the system prompt: https://openai.com/blog/function-calling-and-other-api-updat...

Every time I look at this paper my inclination drifts further away from harmless incompetence. Matei Zaharia is the CTO of Databricks, it feels like too perfect of a coincidence that someone who built a career on ML and study would suddenly drop the ball right as their company is trying to pivot to on-premise MLOps, who's prime competition is ChatGPT...

link