|
|
|
|
|
by tedsanders
132 days ago
|
|
If you don't believe me, that's fair enough. Some pieces of evidence that might update you or others: - a member of the team who worked with this eval has left OpenAI and now works at a competitor; if we cheated, he would have every incentive to whistleblow - cheating on evals is fairly easy to catch and risks destroying employee morale, customer trust, and investor appetite; even if you're evil, the cost-benefit doesn't really pencil out to cheat on a niche math eval - Epoch made a private held-out set (albeit with a different difficulty); OpenAI performance on that set doesn't suggest any cheating/overfitting - Gemini and Claude have since achieved similar scores, suggesting that scoring ~40% is not evidence of cheating with the private set - The vast majority of evals are open-source (e.g., SWE-bench Pro Public), and OpenAI along with everyone else has access to their problems and the opportunity to cheat, so FrontierMath isn't even unique in that respect |
|