Hacker News new | ask | show | jobs
by EvgeniyZh 1030 days ago
I have a several arguments why contamination is probably not the main reason of performance difference.

When we worked on StarCoder, people ran gpt-4 on MultiPL-E, which doesn't have canonical solutions in the internet, and the performance was higher that what you would expect from official numbers

Official contamination analysis shows only minor drop in performance even though contamination is fairly high (you may argue that contamination is higher now or that rlhf has stronger effect)

There is significant drop in performance when testing on HumanEval+ [1], which shouldn't happen if model has canonical solutions.

BTW why don't you use HumanEval+?

[1] https://arxiv.org/abs/2305.01210

1 comments

The "intelligence" of large language models needs to be evaluated like the abilities of self-proclaimed psychics. You send your binary to an independent third party and who evaluates it on new problems. It's only a "Human eval" once.