Hacker News new | ask | show | jobs
by esperent 1031 days ago
I'm aware of at least one study by Stanford. PDF paper linked in this article:

https://www.techopedia.com/is-gpt-4-a-flop

Of course, I'd like to see more than one study. But this one is by a well known university, and it's pretty conclusive. GPT-4 is getting worse (especially for code, maths, and analytical reasoning) and more censored.

4 comments

It's important to frame this correctly. The article is a bit misguided (it doesn't matter which university publishes an article) because there are so many ways in which a model can be altered, even excluding retraining weights. Also, even if the performance has dropped practically due to removing some resources for more shortcuts to be taken (for example changing beam search and typical sampling parameters), making implications about the outlook for the future is not really appropriate, since retraining weights, changing architecture, etc can improve capabilities immensely.

It's important not to suggest that GPT systems in general are on the way outside.ply due to some small alterations in parameters that make a system slightly less performant (which seems to be a popular perspective).

Isn't this the study that asked a bunch of questions with the same answer ("yes") and basically the old model always answered "yes" and the new model always answered "no"? That's not a degradation in performance. It was never answering the questions in the first place, just guessing. The only thing that changed was the default guess.
No it's not. Here's a link to the study, you can check the questions they asked.

https://arxiv.org/pdf/2307.09009.pdf

You are wrong. What I described is exactly what they did in the "math" benchmark of v1 of the study (https://arxiv.org/pdf/2307.09009v1.pdf). They asked "is this number prime" for a bunch of prime numbers. The old version gave a bunch of "reasoning" that was actually faulty and then guessed "yes". The new version guessed "no" (which is arguably a better guess as to whether a random number is prime). In neither case did it actually do the work required to answer the question and the change in "correct" answers is an illusion.

In the programming category the newer GPT-4 actually performed significantly better but started formatting code with backticks that the study's evaluation code didn't handle properly, so they falsely concluded that it was worse. https://twitter.com/Si_Boehm/status/1681801371656536068

They later submitted a revision to the study attempting to correct these blatant issues but trusting their work is clearly a terrible idea. The study was executed very poorly and should be ignored with extreme prejudice.

At this point you've gotten like 3 or 4 replies explaining how at best you're drawing a flawed conclusion from the paper, and at worst it's a flawed paper in itself.

Funnily enough, just skimming through it again I found yet another glaring mistake they made: they left the system prompt empty for both checkpoints, yet the headline feature of the new checkpoint was improved steerability via the system prompt: https://openai.com/blog/function-calling-and-other-api-updat...

Every time I look at this paper my inclination drifts further away from harmless incompetence. Matei Zaharia is the CTO of Databricks, it feels like too perfect of a coincidence that someone who built a career on ML and study would suddenly drop the ball right as their company is trying to pivot to on-premise MLOps, who's prime competition is ChatGPT...

On most of their tests gpt-4 is not actually worse [1]. In particular coding results are affected by changed due to different output format rather than worse abilities [2]. But that's ok because the message of the paper is that there is strong drift between versions and developers should be aware of it, not that gpt becomes worse [3].

[1] https://www.aisnakeoil.com/p/is-gpt-4-getting-worse-over-tim...

[2] https://twitter.com/Si_Boehm/status/1681801371656536068

[3] https://twitter.com/matei_zaharia/status/1681805357516210177

Also remember bad research can come out of good universities. Remember the gzip compressor beats BERT paper that showed gzip beat bert at many KNN based tasks? Or just Google for Wansink Cornell.

So best to treat every paper like a i.i.d sample and judge them.om their on their own merits.