|
|
|
|
|
by rushingcreek
1031 days ago
|
|
Right, but there's no contamination studies there. I suspect that RLHF data leaked HumanEval into GPT-4. It just seems unlikely to me that GPT-4's coding abilities have improved since March (when 67% was officially reported by OpenAI) given all of the examples and anecdotes about degradation. This is why we use the official numbers. |
|
When we worked on StarCoder, people ran gpt-4 on MultiPL-E, which doesn't have canonical solutions in the internet, and the performance was higher that what you would expect from official numbers
Official contamination analysis shows only minor drop in performance even though contamination is fairly high (you may argue that contamination is higher now or that rlhf has stronger effect)
There is significant drop in performance when testing on HumanEval+ [1], which shouldn't happen if model has canonical solutions.
BTW why don't you use HumanEval+?
[1] https://arxiv.org/abs/2305.01210