Hacker News new | ask | show | jobs
by lhl 1030 days ago
> 2. GPT-4 does not seem to have improved on real-world coding tasks since March, so it's unclear where any purported HumanEval gains could've come from

Once Markdown formatting is accounted for, the June model improves answers on the Leetcode questions from the LLM Drift paper testing to 70% (35/50) vs the March model's 52% (26/50).

see:

* https://github.com/lchen001/LLMDrift/blob/main/generation/

* https://twitter.com/Si_Boehm/status/1681801371656536068