Hacker News new | ask | show | jobs
FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI (epochai.org)
185 points by sshroot 584 days ago
6 comments

For some context on why this is important: this benchmark was designed to be extremely challenging for LLMs, with problems requiring several hours or days of work by expert mathematicians. Currently, LLMs solve 2% of problems in the set (which is kept private to prevent contamination).

They even provide a quote from Terence Tao, which helped create the benchmark (alongside other Field medalists and IMO question writers):

> “These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…”

Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.

[1]: https://manifold.markets/MatthewBarnett/will-an-ai-achieve-8...

If I was going to bet, I would bet yes, they will reach above 85% performance.

The problem with all benchmarks, one that we just don't how to solve, is leakage. Systematically, LLMs are much better at benchmarks created before they were trained than after. There are countless papers that show significant leakage between training and test sets for models.

This is in part why so many LLMs are so strong according to benchmarks, particularly older popular benchmarks, but then prove to be so weak in practice when you try them out.

In addition to leakage, people also over-tune their LLMs to specific datasets. They also go out and collect more data that looks like the dataset they want to perform well on.

There's a lot of behind the scenes talk about unethical teams that collect data which doesn't technically overlap test sets, but is extremely close. You can detect this if you look at the pattern of errors these models make. But no one wants to go out and accuse specific teams, at least not for now.

Could you run the benchmark by bootstrapping (average of repeated subsampling), instead of a straight-across performance score, and regain some leakage resistance that way? As well as a better simulation of "out of sample" data, at least for a little while.
This benchmark’s questions and answers will be kept fully private, and the benchmark will only be run by Epoch. Short of the companies fishing out the questions from API logs (which seems quite unlikely), this shouldn’t be a problem.
> answers will be kept fully private

> Short of the companies fishing out the questions from API logs (which seems quite unlikely)

They all pretty clearly state[1] versions of "We use your queries (removing personal data) to improve the models" so I'm not sure why that's unlikely.

https://help.openai.com/en/articles/5722486-how-your-data-is...

Ideally they would have batches of those exercises, where the only use the next batch when someone has solved a suspicious amount of those exercises. If it performs much worse on the next batch, that is a tell of leakage.
I looked at the sample questions and even if they get the questions there is no way they will figure out the answers without making significant breakthroughs in understanding mathematics and logic.
>Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.

Or they know the ancient technique of training on the test set. I know most of the questions are kept secret, but they are being regularly sent over the API to every LLM provider.

Although the answer isn't sent, so it would have to be a very deliberate effort to fish those out of the API chatter and find the right domain expert with 4-10 hours to spend on cracking it

Just letting the AI train on its own wrong output wouldn't help. The benchmark already gives them lots of time for trial and error.

Why do people still insist that this is unlikely? Like assuming that the company that payed 15M for chat.com does not have some spare change to pay some graduate students/postdocs to solve some math problems. The publicity of solving such benchmark would definitely raise the valuation so it would 100% be worth it for them...
Any benchmark which isn't dynamically generated is useless for that very reason.
Simple: I highly doubt they're willing to risk a scandal that would further tarnish their brand. It's still reeling from last year's drama, in addition to a spate of high-profile departures this year. Not to mention a few articles with insider sources that aren't exactly flattering.
I doubt it would be seen as scandal. They can simply generate training data for these questions just like how they generate for other problems. Only difference is probably pay rate is much higher for this kind training data than most other areas.
You’re not thinking about the other side of the equation. If they win (becoming the first to excel at the benchmark), they potentially make billions. If they lose, they’ll be relegated to the dustbin of LLM history. Since there is an existential threat to the brand, there is almost nothing that isn’t worth risking to win. Risking a scandal to avoid irrelevance is an easy asymmetrical bet. Of course they would take the risk.
Parallel construction

Doesnt cause too much scandal lol

Of course lol. How come e.g. o1 scores so high on these reasoning and math and IMO benchmarks and then fails every simple question I ask of it? The answer is training on the test set.
> Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.

Why surprisingly?

2028 is twice as long as capable LLMs existed to date. By "capable" here I mean capable enough to even remotely consider the idea of LLMs solving such tasks in the first place. ChatGPT/GPT-3.5 isn't even 2 years old!

4 years is a lot of time. It's kind of silly to assume LLM capabilities have already bottomed out.

Sure but it is also reasonable to consider that the pace of progress is not always exponential or even linear at best. Diminishing returns are a thing and we already know that a 405b model is not 5 times better than a 70b model.
Yes, but!

Exponential pace of progress isn't usually just one thing; if you zoom in, any particular thing may plateau, but its impact compounds in enabling growth of successors, variations, and related inventions. Nor is it a smooth curve, if you look closely. I feel statements like "a 405b model is not 5 times better than a 70b model" are zooming in on a specific class of models so much you can see the pixels of the pixel grid. There's plenty of open and promising research in tweaking the current architecture in training or inference (see e.g. other thread from yesterday[0]), on top of changes to architecture, methodology, methods of controlling or running inference on exiting models by lobotomizing them or grafting networks to networks, etc. The field is burning hot right now, we're counting space between incremental improvements and interesting research directions in weeks. The overall exponent of "language models" power may just well continue when you zoom out a little bit further.

--

[0] - https://news.ycombinator.com/item?id=42093112

How do you determine the multiplier. Because e.g. there are many problems that GPT4 can solve while GPT3.5 can't. In this case it is infinitely better.
Let's say your benchmark gets you at 60% with a 70b parameter model and you get to 65% with a 405b one, it's fairly obvious that it's just incremental progress, not a sustainable growth of capabilities per added parameter. Also, most of the data used these days for trainings these very large models is synthetic data, which is probably very low quality overall compared to human-sourced data.
But so if there's a benchmark that a model scores at 60%, does it mean that it's literally impossible to make anything that could be more than 67% better?

E.g. if someone scores 60% at a high school exam, is it impossible for anyone to be more than 67% smarter than this person at that subject?

Then what if you have another benchmark where GPT3.5 scores 0%, but GPT4 scores 2%. Does it make GPT4 infinitely better?

E.g. supposedly there was one LLM that did 2% in FrontierMath.

I think because if you end up having an AI that is as capable as the graduate students Tao is used to dealing with (so basically potential field medalists) then you are basically betting that 85% chance something like AGI (at least in consequence) will be here in 3 years. It is possible, but 85% chance?
It would also require ability to easily handle large amount of complex information and dependencies such as massive codebases etc and then also be able to operate physically like humans do. By controlling a robot of some sort.

Being able to solve self contained exercise can be obviously very challenging, but there are other different types of skills that might or might not be related and have to be solved as well.

>then you are basically betting that 85% chance something like AGI

Not really. It would just need to do more steps in a sequence that current models do. And that number has been going up consistently. So it would be just another narrow AI expert system. It is very likely that it will be solved, but it is very unlikely that it will be generally capable in the sense most researchers understand AGI today.

I am willing to bet it won't be solved by 2028 and the betting market is overestimating AI capabilities and progress on abstract reasoning. No current AI on the market can consistently synthesize code according to a logical specification and that is almost certainly a requirement for solving this benchmark.
What research are you basing this on? Because in particular fill in the middle and other non-standard approaches to code generation have shown incredible capability. I'm pretty sure by 2028 LLMs will be able to write code to specification better than most human programmers. Maybe not on the level of million line monolithic codebases that certain engineers worked on for decades, but smaller, modern projects for sure.
People really love pointing at the first part of a logistic curve and go "behold! an exponential".
Do they? My impression's been the opposite in the recent years - S-curve is a meme at this point, and is used as middlebrow dismissal.

"All exponents in nature are s-curves" isn't really useful unless you can point at the limiting factors more precisely than "total energy in observable universe" or something. And you definitely need more than "did you know that exponents are really s-curves?" to even assume we're anywhere close to the inflection point.

I think (to give them the most generous read) they are just betting the halfway is still pretty far ahead. It is a different bet but IMO not an inherently ridiculous one like just misidentifying the shape of the thing; everything is a logistic curve, right? At least, everything that doesn’t blow up to infinity.
Except LLM capabilities have already peaked. Scaling has rapidly diminishing returns.
I have yet to see any published evidence of that.
Since you go that route, do you have published evidence that shows they HAVENT entered the top of the S-curve?
For one, thinking LLMs have plateaued is essentially assuming that video can't teach AI anything. It's like saying a person locked into a room his whole life with only books to read would be as good at reasoning as someone's who's been out in the world.
What reason you have to believe we're anywhere close to the middle of the S-curve? S-curve may be only sustainable shape in nature in the limit, it doesn't mean any exponent you see someone claims is already past the inflection point.
Why are you thinking in binary. It is not clear at all to me that the progress is stagnating, and in fact I am still impressed by the progress. But I couldn't tell whether there is going to come a wall or not. There is no clear reason why there should be some sort of standard or historical curve for this progress.
This is not a linear process. Deep-learning models do not scale that way.
What kind of evidence could convince you?
Market size matters. There's a whopping total of 71 bidders on that.
Would be interesting to know which model solved the 2% and what is the nature of the problems it solved.
These benchmarks are entirely pointless.

The people making them are specialists attempting to apply their skills to areas unrelated to LLM performance, a bit like a sprinter making a training regimen for a fighter jet.

What matters is the data structures that underlie the problem space - graph traversal. First, finding a path between two nodes; second, identifying the most efficient path; and third, deriving implicit nodes and edges based on a set of rules.

Currently, all LLMs are so limited that they struggle with journeys longer than four edges, even when given a full itinerary of all edges in the graph. Until they can consistently manage a number of steps greater than what is contained in any math proof in the validation data, they aren’t genuinely solving these problems; they’re merely regurgitating memorized information.

> Currently, all LLMs are so limited that they struggle with journeys longer than four edges, even when given a full itinerary of all edges in the graph.

This is probably not the case for LLMs in the o1 series and possibly Claude 3.5 Sonnet. Have you tested them on this claim?

Yes, they also fail. I've found the original gpt4 to be the most consistent. One of these days I'll spend the couple of thousands needed to benchmark all the top models and see how they actually perform on a task which can't be gamed.
What kinds of problems in what domains did you test o1 models with?

I found that they are good at logic and math problems but still hallucinate. I didn’t try to stretch test them with hard problems though.

Finding a path between two vertices when given an itinerary of all the edges in a general graph, exactly what I said in the OP.
Not to mention that math proofs are more than graph trasversals... (Although maybe simple math problems are not) There is the problem of extracting the semantics of math formalisms. This is easier in day to day language, I don't know to what extent LLMs can also extract the semantics and relations of different mathematical abstractions.
It will be a useful benchmark to validate claims by people like Sam Altman about having achieved AGI.
Most humans can't solve these problems, so it's certainly possible to imagine a legitimate AGI that can't either.
But humans can solve these problems given enough time and domain knowledge. An LLM would never be able to solve them unless they get smarter. Thats the point.

It’s not about whether a random human can solve them. It’s whether AI, in general, can. Humans, in general, have proven to be able to solve them already.

I'm responding to this:

> It will be a useful benchmark to validate claims by people like Sam Altman about having achieved AGI.

I think it is possible to achieve AGI without creating an AGI that is an expert mathematician, and that it is possible to create a system that can do FrontierMath without achieving AGI. I.e. I think failure or success at FrontierMath is orthogonal to achieving AGI (though success at it may be a step on the way). Some humans can do it, and some AGIs could do it, but people and AI systems can have human-level intelligence without being able to do it. OTOH I think it would be hard to claim you have ASI if it can't do FrontierMath.

It is very much an open question just what an llm can solve when allowed to generate an indefinite number of intermediate tokens and allowed to sample an arbitrary amount of text to ground itself.

There are currently no tools that let llms do this and no one is building the tools for answering open ended questions.

That's correct. Thanks for clarifying for me because I have gotten tired with the comparison to "99% of humans can't do this" as a counter-argument to AI hype criticism.
AGI should be able to do anything the best humans can do. ASI is when it does everything better than the best humans.
Those thresholds look the same to me, personally.

An AI that can be onboarded to a random white collar job, and be interchangeably integrated into organisations, surely is AGI for all practical purposes, without eliminating the value of 100% of human experts.

If an AI achieved 100% in this benchmark it would indicate super-intelligence in the field of mathematics. But depending on what else it could do it may fall short on general intelligence across all domains.
> they’re merely regurgitating memorized information

Source?

If a model can't inately reason over 5 steps in a simple task but produces a flawless 500 step proof you either have divine intervention or memorisation.
AlphaGeometry has entered the chat.

Also, AIMOv2 is doing stage 2 of their math challenge, they are now at "national olympics" level of difficulty. They have a new set of questions. Last year's winner (27/50 points) got 2/50 on the new set. In the first 3 weeks of the competition the top score is 10/50 on the new set, mostly with Qwen2.5-math. Given that this is a purposefully made new set of problems, and according to the organizers "made to be AI hard", I'd say the regurgitation stuff is getting pretty stale.

Also also, the fact that claude3.5 can start coding in an invented language w/ ~20-30k tokens of "documentation" about the invented language is also some kind of proof that the stochastic parrots are the dismissers in this case.

I've not tested those models. Feel free to flick me through a couple of k in bitcoins if you'd like me to have a look for you.
I'm not sure if it is feasible to provide all relevant sources to someone who doesn't follow a field. It is quite common knowledge that LLMs in their current form have no ability to recurse directly over a prompt, which inherently limits their reasoning ability.
I am not looking for all sources. And I do follow the field. I just don’t know the sources that would back the claim they are making. Nor do I understand why limits on recursion means there is no reasoning and only memorization.
This is just totally false.

That's exactly what countless techniques related to chain of thought do.

The closest explanation to how chain of through works is suppressing the probability of a termination token.

People have found that even letting llms generate gibberish tokens produces better final outputs. Which isn't a surprise when you realise that the only way a llm can do computation is by outputting tokens.

It’s sometimes like, are these critics using the tools? It’s a strange schism at the moment.
he just explained it to you.
Regarding keeping the test set private to avoid contamination, the comments about leakage are spot on. The real test set should always be the future.

We should evaluate LLMs on text from beyond their knowledge cutoff date, by computing their per-byte perplexity or per-byte compression ratio. There's a deep theoretical connection between compression and learning.

The intuition here is that being able to predict the future of science (or any topic, really) is indicative of true understanding. Slightly more formally: When ICLR 2025 announces and publishes the accepted papers, Yoshua Bengio is less surprised/perplexed by what's new than a fresh PhD student. And Terence Tao is less surprised/perplexed by what will be proven in math in the next 10 years than a graduate student in a related field.

This work has it right: https://ar5iv.labs.arxiv.org/html//2402.00861

Interesting take sounds like MDL (Minimum description length) for LLMs!
ScholarlyArticle: "FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI" (2024) https://arxiv.org/abs/2411.04872 .. https://epochai.org/frontiermath/the-benchmark :

> [Not even 2%]

> Abstract: We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics -- from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days. FrontierMath uses new, unpublished problems and automated verification to reliably evaluate models while minimizing risk of data contamination. Current state-of-the-art AI models solve under 2% of problems, revealing a vast gap between AI capabilities and the prowess of the mathematical community. As AI systems advance toward expert-level mathematical abilities, FrontierMath offers a rigorous testbed that quantifies their progress.

Additional AI math benchmarks:

- "TheoremQA: A Theorem-driven [STEM] Question Answering dataset" (2023) https://github.com/TIGER-AI-Lab/TheoremQA

Very cool. It'll be nice to have a benchmark that can be used to validate abstract reasoning capabilities because the hype is really starting to get out of hand.
I wonder if the best benchmark is a Prolog program that generates tests of logical reasoning. You could have a functionally infinite stream of test cases!
Part of the magic of mathematical reasoning in humans is our ability to sidestep incompleteness theorems or undecidability headaches by simply changing the rules as befits the problem at hand: using a logical tool to solve a math problem seems largely formalizable and testable with Prolog/Lean/etc, but selecting or designing such a tool - e.g. choosing good definitions and axioms - is much more mysterious.

Put a bit more poetically: a Prolog benchmark can adequately test an LLM’s ability to create proofs in Euclidean geometry. But it will never test an LLM’s ability to reason whether a given axiomatization of geometry is actually a reasonable abstraction of physical space. And if our LLMs can do novel Euclidean proofs but are not able to meta-mathematically reason about novel axioms, then they aren’t really using intelligence. Formal logical puzzles are only a small subset of logical reasoning.

Likewise, when Euclidean proofs were a fun pastime among European upper-classes, the real work was being done by mathematicians who built new tools for projective and analytic geometry. In some sense our LLM benchmarks are focusing on the pastime and not the work. But in another sense LLMs are focusing on the tricksy and annoying sides of actually proving things, leaving humans free to think about deeper problems. So I’m not skeptical of LLMs’ utility in mathematical research, but rather the overinflated (and investor-focused) claims that this stuff is a viable path to AGI.

You could but most LLMs can't solve sudoku puzzles even though the training corpus already contains books on logic, constraint propagation, and state space exploration with backtracking.
the LLM just doesn't have enough compute I think probably step by step it could do it. anyways most leading LLMs can write a backtracking search program to do it and I think +tool use should be counted
I mean, this benchmark is really hard.

I don't think it's a requirement that a system claiming to be AGI should be able to solve these problems, 99.99% of humans can't either.

An AGI is often claimed to be a general purpose problem solver and these are exactly the types of problems that a general purpose problem solver would be able to solve if given access to a mathematical library. All existing LLMs have been trained on abstract mathematics and logic but it is obvious that they are incapable of abstract logical reasoning, e.g. solving sudoku puzzles.
Here is my prediction, FWIW: the hard part of the problem has already been solved, in the following technical sense: there is a few 1000 lines program that has not been invented yet, but it will be invented soon, that loads a current LLM model, runs fast on current hardware, and you will deem it to be an AGI. In other words, the conditional Kolmogorov complexity of undisputable AGI given the Llama weights is only a few 1000 bytes. We are at the pre-AlphaGo, post Clark-Storkey stage of reasoning. That's my guess, anyway.
I think you are likely right but coming up with that final inference strategy is still "the hard part" IMO. Not in terms of computation, but in terms of algorith development.
99.9% of the population would not be able to solve these problems given a year and access to every piece of mathematical literature ever written (except the solutions to these problems of course).

Saying that you need to solve these to be considered AGI is ridiculously strict.

How do they solve the 2%? This is the question. If those problems were unseen, that might be already very impressive.
Not very impressed by the problems they displayed but I guess there should be some good problems in the set given the comments (not in the sense that I find them super easy but they seems random and not super well-posed, and extremely artificial problems--in the sense that they seem to not be of particular mathematical interest[or at least the mathematical content of the problem is being deliberately hidden for testing purposes] but constructed according to some weird criteria). Would be happy to hear an elaboration on the comments by the well-known mathematicians
Hmm. I’m a hard disagree. The problems they show have a number of really nice properties for LLM assessment: They require broad, often integrated knowledge of diverse areas of mathematics, the answers reduce to a number, often a very large number, and thus extremely difficult to guess, and they require a significant amount of symbolic parsing and (I would say) reasoning skills. If we think about what makes a quality mathematician, I’d propose it’s the ability to come at a problem both from the top —- conceptually — and from the bottom — applying various tools and transformations — with a sort of direction in mind that gets to a result.

I’d say these problems strongly encourage that sort of behavior.

I’m also someone who thinks building in abilities like this to LLMs would broadly benefit the LLMs and the world, because I think this stuff generalizes. But, even if not, It would be hard to say that an LLM that could test 80% on this benchmark would be not useful to a research mathematician. Terence Tao’s dream is something like this that can hook up to LEAN, leaving research mathematicians as editors, advisors, and occasionally working on the really hard parts while the rest is automated and provably correct. There’s no doubt in my mind that a high scoring LLM for this benchmark would be helpful in that concept.

I guess the primary reason is that the answers must be numbers that can be verified easily. Otherwise, you just flood the validator with long LLM reasoning that's hard to verify. People have been proposing using LEAN as a medium for answers but AFAIK even LEAN is not mainstream in the general math community, so there's always trade-offs.

Also, coming up with good problems is an art in its own right; the Soviets was famous for institutionalizing anti-Semitism via special math puzzles for Jews in Moscow Univerisity entrance exams. The questions are constructed as such that are hard to solve but have some elementary solutions to divert criticism.