|
|
|
|
|
by rushingcreek
1030 days ago
|
|
There's a couple of things here: 1. I'm not saying we have to wait until GPT-5, we just need an apples-to-apples comparison where contamination is taken into account 2. GPT-4 does not seem to have improved on real-world coding tasks since March, so it's unclear where any purported HumanEval gains could've come from 3. I've personally noticed degradation anecdotally in the GPT-4 June update vs. the original March release |
|
Once Markdown formatting is accounted for, the June model improves answers on the Leetcode questions from the LLM Drift paper testing to 70% (35/50) vs the March model's 52% (26/50).
see:
* https://github.com/lchen001/LLMDrift/blob/main/generation/
* https://twitter.com/Si_Boehm/status/1681801371656536068