| For some context on why this is important: this benchmark was designed to be extremely challenging for LLMs, with problems requiring several hours or days of work by expert mathematicians. Currently, LLMs solve 2% of problems in the set (which is kept private to prevent contamination). They even provide a quote from Terence Tao, which helped create the benchmark (alongside other Field medalists and IMO question writers): > “These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…” Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028. [1]: https://manifold.markets/MatthewBarnett/will-an-ai-achieve-8... |
The problem with all benchmarks, one that we just don't how to solve, is leakage. Systematically, LLMs are much better at benchmarks created before they were trained than after. There are countless papers that show significant leakage between training and test sets for models.
This is in part why so many LLMs are so strong according to benchmarks, particularly older popular benchmarks, but then prove to be so weak in practice when you try them out.
In addition to leakage, people also over-tune their LLMs to specific datasets. They also go out and collect more data that looks like the dataset they want to perform well on.
There's a lot of behind the scenes talk about unethical teams that collect data which doesn't technically overlap test sets, but is extremely close. You can detect this if you look at the pattern of errors these models make. But no one wants to go out and accuse specific teams, at least not for now.