Hacker News new | ask | show | jobs
by Davidzheng 591 days ago
Not very impressed by the problems they displayed but I guess there should be some good problems in the set given the comments (not in the sense that I find them super easy but they seems random and not super well-posed, and extremely artificial problems--in the sense that they seem to not be of particular mathematical interest[or at least the mathematical content of the problem is being deliberately hidden for testing purposes] but constructed according to some weird criteria). Would be happy to hear an elaboration on the comments by the well-known mathematicians
2 comments

Hmm. I’m a hard disagree. The problems they show have a number of really nice properties for LLM assessment: They require broad, often integrated knowledge of diverse areas of mathematics, the answers reduce to a number, often a very large number, and thus extremely difficult to guess, and they require a significant amount of symbolic parsing and (I would say) reasoning skills. If we think about what makes a quality mathematician, I’d propose it’s the ability to come at a problem both from the top —- conceptually — and from the bottom — applying various tools and transformations — with a sort of direction in mind that gets to a result.

I’d say these problems strongly encourage that sort of behavior.

I’m also someone who thinks building in abilities like this to LLMs would broadly benefit the LLMs and the world, because I think this stuff generalizes. But, even if not, It would be hard to say that an LLM that could test 80% on this benchmark would be not useful to a research mathematician. Terence Tao’s dream is something like this that can hook up to LEAN, leaving research mathematicians as editors, advisors, and occasionally working on the really hard parts while the rest is automated and provably correct. There’s no doubt in my mind that a high scoring LLM for this benchmark would be helpful in that concept.

I guess the primary reason is that the answers must be numbers that can be verified easily. Otherwise, you just flood the validator with long LLM reasoning that's hard to verify. People have been proposing using LEAN as a medium for answers but AFAIK even LEAN is not mainstream in the general math community, so there's always trade-offs.

Also, coming up with good problems is an art in its own right; the Soviets was famous for institutionalizing anti-Semitism via special math puzzles for Jews in Moscow Univerisity entrance exams. The questions are constructed as such that are hard to solve but have some elementary solutions to divert criticism.