|
|
|
|
|
by EvgeniyZh
1030 days ago
|
|
I have a several arguments why contamination is probably not the main reason of performance difference. When we worked on StarCoder, people ran gpt-4 on MultiPL-E, which doesn't have canonical solutions in the internet, and the performance was higher that what you would expect from official numbers Official contamination analysis shows only minor drop in performance even though contamination is fairly high (you may argue that contamination is higher now or that rlhf has stronger effect) There is significant drop in performance when testing on HumanEval+ [1], which shouldn't happen if model has canonical solutions. BTW why don't you use HumanEval+? [1] https://arxiv.org/abs/2305.01210 |
|