Right, but there's no contamination studies there. I suspect that RLHF data leaked HumanEval into GPT-4.
It just seems unlikely to me that GPT-4's coding abilities have improved since March (when 67% was officially reported by OpenAI) given all of the examples and anecdotes about degradation.
I have a several arguments why contamination is probably not the main reason of performance difference.
When we worked on StarCoder, people ran gpt-4 on MultiPL-E, which doesn't have canonical solutions in the internet, and the performance was higher that what you would expect from official numbers
Official contamination analysis shows only minor drop in performance even though contamination is fairly high (you may argue that contamination is higher now or that rlhf has stronger effect)
There is significant drop in performance when testing on HumanEval+ [1], which shouldn't happen if model has canonical solutions.
The "intelligence" of large language models needs to be evaluated like the abilities of self-proclaimed psychics. You send your binary to an independent third party and who evaluates it on new problems. It's only a "Human eval" once.
>> given all of the examples and anecdotes about degradation.
How many examples and anecdotes about degradation are actually scientific side-by-side studies? I see absurd articles online about ChatGPT usage going down the drain by kids, completely failing to consider even the most basic fact of seasonality and how school is out for the summer!
It takes like 2-3 experiences of receiving a confidently wrong answer to downgrade your usage. If you use a refactoring tool to rename and it misses one, you won’t use it again.
While that would likely be my experience with a refactoring tool (unless I didn't have a better alternative), that's not my experience with ChatGPT 4.
And that's considering I have very little tolerance for buggy software.
There was a period of a few weeks or months in which it seemed like ChatGPT had really degraded to the point of being unusable (although it could have been my biases). However, it seems to be better now (again, my subjective experience).
Sometimes I still catch it making really basic mistakes, but most times I can convince it to correct the mistake (especially if I point them out).
But what's most amazing to me is how ChatGPT is absolutely brilliant at some things, and not just technical or even obscure topics.
Recently, it gave me the most amazing idea for navigating a complex and nuanced social situation I was having difficulty with. And given the constraints of the situation, there was no way I could have gotten that idea otherwise, especially in the allotted time.
So despite its flaws and mistakes, I still find it to be a tremendously useful tool, even if only to point me in the right direction.
Given the fact that OpenAI has constant resources (for any given small span of time) and varying demand (users and query type), it's not crazy to think they dynamically adjust to consume all available resources on their side.
Obviously the base model would be the same, but aren't there are +/- flavors they could overlay with extra compute? E.g. multi-pass, additional experts, etc.
The benefits to giving someone an occasional "magic" answer are too great not to.
Have there been any wide studies on same-prompt-different-times?
> So despite its flaws and mistakes, I still find it to be a tremendously useful tool, even if only to point me in the right direction.
Much of this resonates. That said, I get tremendous value simply by writing things down (or dictating them) and replying to my own question. I would expect that a sizable fraction of people have forgotten about these strategies and/or don't use them when they are most useful. For many, there is tremendous muscle memory to run a Hooli search almost on mental autopilot. Who has time to slow down and write a well-conceived question? Or perhaps we should turn it around ... On a longer time horizon, who would want to waste time with poorly-conceived questions?
It is the question that starts the process. So we should ask good questions. Do we? I'd be curious about the usage data OpenAI collects. I do my best to lower expectations about people in general, but I'm confident I'd still be unprepared for the level of thought put into questions.
> But what's most amazing to me is how ChatGPT is absolutely brilliant at some things, and not just technical or even obscure topics.
I'm not amazed in the way you are. I expect a variation in quality across topics and domains and question styles.
> I'm not amazed in the way you are. I expect a variation in quality across topics and domains and question styles.
Yes, I can see that. But over time, you also learn and adapt the prompts to ChatGPT's peculiarities so that it provides more useful output.
Still, I'm sure there are many topics/domains for which it's not useful.
As another anecdote, I'm not a mathematician but at one point I was playing around with proving theorems on a theorem prover.
What I found is that ChatGPT is this paradoxical entity which makes the most elementary math errors all the time (I'm talking third-grade level math mistakes), and yet, it was by far the most useful tool ever in coming up with lots of useful PhD-level ideas and math theorems that would allow me to complete proofs when I was completely stuck (and not just for proofs which it had seen before).
It came up all the time with brilliant ideas and theorems which simultaneously I didn't even know existed, were not part of any theorem database of any theorem prover I had seen before (and I've seen the vast majority of them), and there was no way I was going to find them by searching on the web or writing things down on a notepad (I know this because I had tried, for days at a time, along with other ideas such as visualizations and simulations).
That's not to say a mathematician wouldn't be aware of them, but I don't have easy access to one, and I was surely not going to pay one given that I was just exploring, mostly for curiosity.
This seems like a paid ad, but I promise you, I have no affiliation whatsoever...
When I was doing a lot of C++ gamedev, we were definitely doing a lot of stuff that would trip up static analysis, e.g. X-macros.
We would still use refactoring tools even though they would often miss stuff. You just rely on a combination of refactoring tool / search and replace / the compiler.
We would also debug our code in release mode with symbols. You get used to a debugging environment where you don't trust anything you're seeing in variables, etc. too.
Of course, I'd like to see more than one study. But this one is by a well known university, and it's pretty conclusive. GPT-4 is getting worse (especially for code, maths, and analytical reasoning) and more censored.
It's important to frame this correctly.
The article is a bit misguided (it doesn't matter which university publishes an article) because there are so many ways in which a model can be altered, even excluding retraining weights. Also, even if the performance has dropped practically due to removing some resources for more shortcuts to be taken (for example changing beam search and typical sampling parameters), making implications about the outlook for the future is not really appropriate, since retraining weights, changing architecture, etc can improve capabilities immensely.
It's important not to suggest that GPT systems in general are on the way outside.ply due to some small alterations in parameters that make a system slightly less performant (which seems to be a popular perspective).
Isn't this the study that asked a bunch of questions with the same answer ("yes") and basically the old model always answered "yes" and the new model always answered "no"? That's not a degradation in performance. It was never answering the questions in the first place, just guessing. The only thing that changed was the default guess.
You are wrong. What I described is exactly what they did in the "math" benchmark of v1 of the study (https://arxiv.org/pdf/2307.09009v1.pdf). They asked "is this number prime" for a bunch of prime numbers. The old version gave a bunch of "reasoning" that was actually faulty and then guessed "yes". The new version guessed "no" (which is arguably a better guess as to whether a random number is prime). In neither case did it actually do the work required to answer the question and the change in "correct" answers is an illusion.
In the programming category the newer GPT-4 actually performed significantly better but started formatting code with backticks that the study's evaluation code didn't handle properly, so they falsely concluded that it was worse. https://twitter.com/Si_Boehm/status/1681801371656536068
They later submitted a revision to the study attempting to correct these blatant issues but trusting their work is clearly a terrible idea. The study was executed very poorly and should be ignored with extreme prejudice.
At this point you've gotten like 3 or 4 replies explaining how at best you're drawing a flawed conclusion from the paper, and at worst it's a flawed paper in itself.
Funnily enough, just skimming through it again I found yet another glaring mistake they made: they left the system prompt empty for both checkpoints, yet the headline feature of the new checkpoint was improved steerability via the system prompt: https://openai.com/blog/function-calling-and-other-api-updat...
Every time I look at this paper my inclination drifts further away from harmless incompetence. Matei Zaharia is the CTO of Databricks, it feels like too perfect of a coincidence that someone who built a career on ML and study would suddenly drop the ball right as their company is trying to pivot to on-premise MLOps, who's prime competition is ChatGPT...
On most of their tests gpt-4 is not actually worse [1]. In particular coding results are affected by changed due to different output format rather than worse abilities [2]. But that's ok because the message of the paper is that there is strong drift between versions and developers should be aware of it, not that gpt becomes worse [3].
Also remember bad research can come out of good universities. Remember the gzip compressor beats BERT paper that showed gzip beat bert at many KNN based tasks? Or just Google for Wansink Cornell.
So best to treat every paper like a i.i.d sample and judge them.om their on their own merits.
Quality has degraded but token generation speed has increased. The GPT4 of today isn’t the same as it used to be. To get the old slow model you need to use GPT4-0314, that gives higher quality answers like it used to.
But the model in OP is fine-tuned by "a proprietary dataset of ~80k high-quality programming problems and solutions". How do we know it's not contaminated by HumanEval too?
(Chat)GPT-4s practical coding abilities are now 100x because it can code, run the code, and reason about its performance mid-response. They must be using fine tunes for this so the overall model could well be better too
"model" is end-to-end, input-to-output, inclusive of the entire framework and it's guardrails and everything else
if they are able to detect hallucinations, filter them out and automatically re-run, that's a huge improvement in result, even though core model didn't get new training
There weren't any serious examples of degradation.
Does only GPT-4 have to suffer a penalty for HumanEval leaking into training data/RLHF data?
Ignoring those concerns, it fails a reaonable-ness smell test:
We'd have to pretend its the original GPT-4 release from March 2023 until GPT-5 comes out, and only then can OpenAI's work be compared to LLAMA-2 to LLAMA-N.
> 2. GPT-4 does not seem to have improved on real-world coding tasks since March, so it's unclear where any purported HumanEval gains could've come from
Once Markdown formatting is accounted for, the June model improves answers on the Leetcode questions from the LLM Drift paper testing to 70% (35/50) vs the March model's 52% (26/50).
1. TL;DR: OpenAI must verify HumanEval data wasn't used in training in order to compare it?
2. Link in the post you replied to.
3. Subjectivity is fine by me! There's a motte & bailey flavor to it if we combine your comment and this one, c.f. "This is why we use the official numbers."
Also for a topic like this, subjectivity is all there really is. Even if you create some metric, what you prioritize is going to be subjective. Because performance is going to vary against different sorts of tasks, and there are a literally infinite number of categories of tasks, so it's not like you can ever truly get a fair sampling.
Because of this, a sample of subjective opinions is probably much more valuable than any official metric, especially if that metric comes from, as you mentioned, individuals/orgs who are highly motivated to game it endlessly. Even when it comes from an external source you end up with a similar risk of it being gamed. It's like how old school Google puzzle interviews went from seeing who was most clever [in that domain], to seeing who'd booked up the most.
It just seems unlikely to me that GPT-4's coding abilities have improved since March (when 67% was officially reported by OpenAI) given all of the examples and anecdotes about degradation.
This is why we use the official numbers.