| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by root-parent 10 days ago
	"...Between April 1 and May 15, 2026, a group of 49 mathematicians compiled a dataset of research-level mathematics questions with known answers... We present the resulting collection of 100 questions....We evaluated these questions in three stages: a single attempt by five state-of-the-art LLMs....we concluded Stage 3 with only 2 unsolved questions. This demonstrates that the mathematical reasoning capabilities of LLMs are becoming impressive..."

1 comments

rabidvermin 10 days ago

mathematics questions with known answers...

... that are therefore liable to be in the training data?

link

fc417fc802 10 days ago

I had the same thought, because even if the exact solution doesn't appear there's a notable difference between performing a literature search versus solving something de novo. But I think perhaps this benchmark wasn't meant to exclude the former and that the point may have been to test the ability of the model to accurately interpret and synthesize relevant output for research level mathematical problems at all.

link

christianstump 10 days ago

I think you are underestimating the complexity of such problems. A PhD in the exact field of research would need days to weeks to understand what the problem means and how to solve it. This is far beyond "throwing standard techniques" at a problem. (But, I keep emphasizing this, it is also far away from solving research mathematics.)

link

fc417fc802 10 days ago

What did I say that led you to believe I was underestimating the complexity? I don't believe I commented on it at all.

link

christianstump 10 days ago

When you write "there's a notable difference between performing a literature search versus solving something de novo", you suggest that the questions we provided can be solved doing a literature search.

This is incorrect. What is correct is the following: When understanding the existing literature on a question in the dataset, one can derive the answer without creating new mathematics research.

So the difference is "searching the literature" vs "understanding the literature" that made me believe it. But if you didn't that's even better!

link

fc417fc802 10 days ago

I did not suggest that, no. I stress that claiming a possibility is not the same as claiming a fact.

I observed that the two things are quite different in terms of model capabilities. That's relevant when considering how to interpret the results of the benchmark. We need to differentiate between (at minimum) reproducing an (approximately) verbatim answer from the training set, assembling disparate items from the training set into an answer piecewise, and performing novel logical inference using items from the training set.

I further speculated about the intent of the authors but you seem to be saying that my guess was wrong. In response I will observe that for any problem that's known to be solved it's likely to be quite difficult if not impossible to confidently determine that the model performed a de novo derivation as opposed to finding pieces of the answer in various places.

Of course there's absolutely nothing wrong with the latter! It's just important to be aware of the possibility when drawing conclusions about model capabilities.

link

tossandthrow 10 days ago

I can recommend reading section 2 of the paper.

The goal was not to define unsolved problems.

But as such, the problems are also not previously published problems.

This seems quite reasonable IMHO.

link

criemen 10 days ago

Partially, 2.2 Submission workflow W2 deals with this:

> Stage W2 The five project-active models, see Table 2, attempted the question. Their answers were compared to the original answer by an LLM judge. If at most three models answered correctly, the contributor could proceed.

So "trivially contained in the training data" is excluded, as then all models could/should easily come up with the solution.

link

andy99 10 days ago

“In the training data” isn’t really relevant for a modern LLM. The better question would be are they solvable using known techniques that have been fine-tuned in.

A simple example, as a non-mathematician: I’d expect a well trained LLM to be able to solve any integral that can be solved with integration by parts. I would be much more interested to see it solve one with no know solution using some novel technique.

Obviously this doesn’t really lend itself to making a benchmark, but if something is solveable by a known technique, and the LLM has has some kind of RL training re using that technique, seeing a solution isn’t too surprising.

link