| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bawolff 138 days ago
	Paper not about benchmarking or ML research is bad from the perspective of benchmarking. Not exactly a shocker. The authors themselves literally state: "Unlike other proposed math research benchmarks (see Section 3), our question list should not be considered a benchmark in its current form"

1 comments

data_maan 138 days ago

On the website https://1stproof.org/#about they claim: "This project represents our preliminary efforts to develop an objective and realistic methodology for assessing the capabilities of AI systems to autonomously solve research-level math questions."

Sounds to me to be a benchmark in all but a name. And they failed pretty terribly at achieving what they set out to do.

link

bwfan123 138 days ago

> And they failed pretty terribly at achieving what they set out to do.

Why the angst ? If the ai can autonomously solve these problems, isnt that a huge step forward for the field.

link

data_maan 137 days ago

It's not angst. It's intense frustration that they 1) are not doing the science correctly, and 2) that others (e.g. FrontierMath) already did everything they claim to be doing, so we won't learn anything new here, but somehow 1stproof get all the credit.

link

bawolff 137 days ago

Are they really trying to do science, or are they just trying to determine pragmatically whether or not current AI is useful for a research mathematician in their day to day job?

link

data_maan 135 days ago

If it's the latter case (which it has to be), it seems that attention credit (via, e.g., articles in NY Times) is very unfairly distributed.

None of the people that advanced the state of benchmarking and did the hard work on much bigger benchmarks got any, but a ridiculous benchmark of 10 question scored big.

link

bwfan123 137 days ago

> are not doing the science correctly

What do you mean ? These are top-notch mathematicians who are genuinely trying to see how these tools can help solve cutting edge research problems. Not toy problems like those in AIME/AMC/IMO etc. or other similar benchmarks which are gamed easily.

> that others (e.g. FrontierMath) already did everything they claim to be doing

You are kidding right ? FrontierMath benchmark [1] is produced by a startup whose incentives are dubious to say the least.

[1] https://siliconreckoner.substack.com/p/the-frontier-math-sca...

Unlike the AI hypesters, these are real mathematicians trying to inject some realism and really test the boundaries of these tools. I see this as a welcome and positive development which is a win-win for the ecosystem.

link

data_maan 135 days ago

> What do you mean ? These are top-notch mathematicians

YeS. I didn't dispute that. I disputed that they are NOT top notch ML specialist and have made one of the worst benchmarks of 2025-2026. Benchmarks like these would have worked maybe in early 2024 at latest. The field has moved on significantly since.

And yes, many many other benchmarks don't use toy problems -- their names are just a prompt away.

> You are kidding right ? FrontierMath benchmark [1] is produced by a startup whose incentives are dubious to say the least.

They did 1) open source some of their datapoints (on a similar order of magnitude) and 2) they carried out detailed evals. Here is much to learn from their blog posts, much more than from the current dataset.

But fair. If you don't like them, have a look at IMProofBench. Have a look at the AIMO competition. Have a loom at HardMath. It's quite a landscape of datasets already.

> Unlike the AI hypesters, these are real mathematicians trying to inject some realism and really test the boundaries of these tools

As mentioned above, realistic benchmarks that are bigger and better exist. Unfortunately, from a benchmarking POV, these mathematicians are the hypesters with a preprint that wouldnt even make it to the AI&Math workshops at ICML or NeurIPS.

link