| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by amelius 320 days ago
	Since the proof already exists in human-written form, I'm wondering, can't OpenAI's IOM gold winning algorithm not translate the blueprint to lean?

7 comments

jhanschoo 320 days ago

In addition to other comments, see https://xenaproject.wordpress.com/2024/12/11/fermats-last-th...

In particular, note that a key lemma of crystalline cohomology rests on a mistake. Experts think that it is fixable by virtue that results have depended on it for a long time and no issue was found, but it is not fixed.

link

NooneAtAll3 320 days ago

> and my understanding of Maria Ines’ talk is that these issues have now been sorted out

link

jhanschoo 320 days ago

I think you are trying to say that this matter has since been resolved and so presumably the whole informal proof somehow resides in literature. I suppose that to the first point you may be right (I'm guessing that it's since been made available), but to the second point I think you are overconfident that similar gaps do not exist.

I admit that my original comment was inaccurate, as it seems to suggest that the gap still exists.

link

seanwilson 320 days ago

"The International Mathematical Olympiad (IMO) is the World Championship Mathematics Competition for High School students", so not to undermine it but it's below university or graduate level.

Research level mathematics like this is as hard as it gets, and this proof is famously difficult: uses many branches of advanced mathematics, required thousands of pages of proofs, years of work.

link

amelius 320 days ago

Yes but the hard work (coming up with a human-readable proof) has already been done.

link

seanwilson 320 days ago

Human readable (informal) proofs are full of gaps that all have to be traced back to axioms e.g. gaps that rely on shared intuition, background knowledge and other informal proofs.

It's somewhat like taking rough pseudo code (the informal proof, a mixture of maths and English) and translating that into a bullet-proof production app (the formal proof, in Lean), where you're going to have to specify every step precisely traced back to axioms, handle all the edge causes, fix incorrect assumptions, and fill in the missing parts that were assumed to be straightforward but might not be.

A major part is you also have to formalise all the proofs your informal proof relied on so everything is traced back to the initial axioms e.g. you can't just cite Pythagorus theorem, you have to formalise that too.

So it's an order of magnitude more difficult to write a formal proof compared to an informal one, and even when you have the informal proof it can teams many years of effort.

link

bee_rider 320 days ago

I’m almost certain this is ignorance on my part, but it seems like this would mean the proof is… possibly wrong? I mean if there are gaps and other informal proofs in there?

But I thought it was a widely celebrated result.

link

williamstein 320 days ago

Kevin Buzzard told me that the worry that it might in fact be wrong is a huge motivator for him. I also once asked Serge Lang why so much mathematics is correct (which surprised me coming from programming where everything has bugs), and he said; “people do a large number of consistency checks beyond what is in the published proofs, which makes the chances the claimed results are correct much, much higher.” Another related quote Bryan Birch told me once: “it is always a good idea to prove true theorems.”

link

adastra22 320 days ago

I suspect he is massively overestimating the reliability of obscure mathematics.

link

ykonstant 320 days ago

> Kevin Buzzard told me that the worry that it might in fact be wrong is a huge motivator for him.

And then, when I raised concerns in Zulip about Lean's metaprogramming facilities being used to trick the pipeline into accepting false proofs, he said the opposite. He even emphasized that the formalization efforts are not for checking proof correctness, but for cataloguing truths we believe in.

This kind of equivocation turned me away from that community, to be honest. That was an extremely frustrating experience.

link

seanwilson 320 days ago

> “people do a large number of consistency checks beyond what is in the published proofs, which makes the chances the claimed results are correct much, much higher.”

I imagine one bias is because formal verification is such a huge effort, you're only going to do it for really interesting and impactful proofs, which means the proofs that get formal verified will already have been reviewed and critiqued a lot, and will be less likely to have glaring critical errors.

link

seanwilson 320 days ago

> I’m almost certain this is ignorance on my part, but it seems like this would mean the proof is… possibly wrong?

That's part of the motivation to formalise it. When a proof gets really long and complex, relies on lots of other complex proofs, and there's barely any single expert who has enough knowledge to understand all the branches of maths it covers, there's more chance there's a mistake.

There's a few examples here of errors/gaps/flaws found while formalizing proofs that were patched up:

https://mathoverflow.net/questions/291158/proofs-shown-to-be...

My understanding is it's common to find a few problems that then need to be patched or worked around. It's a little like wanting to know if a huge codebase is bug-free or not: you might find some bugs if you formalized the code, but you can probably fix the bugs during the process because it's generally correct. There can be cases where it's not fixable though.

link

infogulch 320 days ago

I think it was Terrance Tao on the Lex Friedman podcast recently that said that there are very often little mistakes in big proofs, but they are almost always able to be patched around. Its like mathematicians' intuition is tracking some underlying reality and the actual formalization is flexible. Yes sometimes digging down into a small mistake leads to an unbridgeable gap and that route has to be abandoned, but uncannily often such issues have nearby solutions.

link

staunton 320 days ago

Also a lot of errors would be called "typos", not errors. Such as some edge cases missing in the theorem statement which technically makes the theorem false. As long as there's a similar theorem in the same spirit that can be proven, that's what the original was all along.

link

6gvONxR4sf7o 320 days ago

The level of rigor used in math if sometimes characterized as "sufficient to convince other mathematicians of correctness." So, yeah possibly, but not in a willy nilly way. It's not a proof sketch, it's a proof. It just isn't written in human language designed for communication.

link

paulddraper 320 days ago

Yes and when it was first published it was wrong (made leap of logic).

It takes thorough review by advanced mathematicians to verify correctness.

This is not unlike a code review.

Most people vastly underestimate how complex and esoteric modern research mathematics are.

link

adastra22 320 days ago

The thing is though that MANY bugs slip through even the most thorough code reviews. As a security researcher, I can tell you there is literally no system out there that doesn't have such a bug in it.

The systems we deal with in software are massive compared with your typical mathematical framework though. But FLT is probably on similar scope.

link

amelius 320 days ago

> It's somewhat like taking rough pseudo code and translating that into a bullet-proof production app

That's actually where LLMs are already quite good at.

link

oasisaimlessly 320 days ago

link

BoiledCabbage 320 days ago

I have little knowledge in this area, but my understanding is it's like this:

There is a pseudocode app that depends on a bunch of pseudocode libraries. They want to translate that pseudocode app into a real runnable app. They can do that, and it's a good amount of work, but reasonable. The problem is to get the app to run they also need to translate the hundreds or thousands of pseudocode libraries into actual libraries. Everything from OS APIs, networking libs, rendering libs, language standard libs all need ro be converted from specs and pseudocode to real code to actually run the app. And that's a ton of work.

link

rcxdude 320 days ago

No, some of the harder work has been done. Translating human-readable proofs into machine-readable ones is also very hard work and an area of active research.

link

rcxdude 320 days ago

This is a significantly harder problem than winning gold in IOM. A large part of it is figuring out how to represent some of the relevant ideas in Lean at all.

link

Davidzheng 320 days ago

I disagree. I think it's a sequence of a huge number of modular moderately hard tasks each much easier than a hard IMO question.

link

markusde 319 days ago

IMO problems are stated in and solved by math a high schooler could understand. Getting the definitions right (one of the harder parts of mechanizing proofs IME) is a different beast altogether.

link

ethan_smith 320 days ago

Formalizing Wiles' proof requires translating hundreds of pages of sophisticated mathematics with implicit reasoning steps into a precise logical framework, which is fundamentally different from the pattern-matching AI uses to solve competition problems.

link

adastra22 320 days ago

That's not how state of the art models work.

link

yorwba 320 days ago

All three claims of gold medal performance on IMO 2025 that I'm aware of solved the first 5 problems, that were designed to be solvable by application of standard techniques, but got stumped on the sixth problem that was a bit more unusual. So it does seem like state-of-the-art models solve competition problems by recognizing which kind of problem it is and applying a corresponding solution template. Which is not too different from human competitors exploiting common question patterns, but humans seem to be able to degrade more gracefully by falling back to a more explorative mode when none of the standard tricks seem to apply.

link

kmill 320 days ago

My understanding is that the proof doesn't exist in written form in its entirety.

Plus, Kevin Buzzard is a world expert with some ideas for how to better organize the proof. In general, formalization leads to new understanding about mathematics.

Something people outside of mathematics don't tend to appreciate is that mathematicians are usually thinking deeply about what we already know, and that work reveals new structures and connections that clarify existing knowledge. The new understanding reveals new gaps in understanding, which are filled in, and the process continues. It's not just about collecting verifiably true things.

Even if somehow the OpenAI algorithm could apply here, we'd get less value out of this whole formalization exercise than to have researchers methodically go through our best understanding of our best proof of FLT again.

link

tripplyons 320 days ago

You're either underestimating the length of the proof, or overestimating the length of tasks that models can currently accomplish.

link

amelius 320 days ago

The blueprint is a step-by-step outline.

link

tripplyons 320 days ago

If the goal is to formalize the proof, you would need more than an outline.

link

voxl 320 days ago

Proof assistant code is high reliability, there is no room to fudge it. This is perhaps the one place where you can really see how bad LLMs are when you care about reliability.

link

UltraSane 320 days ago

Actually automated theorem provers like Lean are the PERFECT use for LLMs because you can instantly determine if the proof it generated is correct.

link

markusde 319 days ago

Proofs, sure, but not definitions. A human needs to be sure that the definitions align with that they expect. Unfortunately, humans generating correct definitions and LLM's generating correct proofs are not independent problems.

link

wizzwizz4 320 days ago

Actually, some proofs take longer for the computer to verify than for a human (even an unskilled typist) to type out. Several hours to evaluate a two-page document isn't unusual. (I prefer to write optimised proofs, but it can take me weeks to find the correct abstractions for that.)

link

UltraSane 320 days ago

Are you talking about Lean 4? Lean 4 is usually pretty fast at verifying proofs and most proofs are created in interactive mode so are checked as they are typed.

A "two-page document" (perhaps 200-300 lines of code) would typically check in a matter of seconds to, at most, a few minutes. If it took hours, it would mean the file contains a pathological case that a developer would be expected to identify and fix. It is absolutely not a normal or "not unusual" occurrence.

link

wizzwizz4 319 days ago

Steps to reproduce:

1. Load a complex data structure, e.g. from JSON. (I do not consider typing out the JSON to be part of the human's job.)

2. Process the data, using the magic of functional programming. (Performance? What's that? Performance is not a priority.)

3. Add some helpful lemmas.

4. Prove the data processing stage was correct… but proving this in general is haaaard, so just make automation that's capable of proving it for most cases… eventually.

5. Great! It takes 3 minutes to run on my tiny examples. Now run it on the real data.

And that's where hours come from!

link

UltraSane 318 days ago

This is NOT what Lean 4 is designed to do. You seem to be using Lean 4 for what TLA+ is much better suited for.

link

griffzhowl 320 days ago

The problem is if they don't produce any correct ones then that doesn't matter

link

adastra22 320 days ago

Why? Coding assistants tend to do even better in contexts where they have tools like type checkers and linters to verify their implementations. This area seems actually uniquely well suited to LLM usage.

link

rocqua 320 days ago

When I asked experts on formal proofs a year ago, their intuition was that there isn't enough formal proofs out there for LLMs to be very good at the syntax.

It's, as far as I know, quite hard to teach an LLM things it doesn't know.

link

UltraSane 320 days ago

He is right and it doesn't matter because you can instantly tell if the proof the LLM generates is true or not.

link

kmill 320 days ago

I see people on Zulip using Copilot to write Lean proofs, and they have some success, but the quality is really bad right now, creating long, unmaintainable proofs. New users get stuck, thinking they're 90% of the way to the end, but really the whole thing probably should be scrapped and they should start over.

It's a bit frustrating because, before Copilot, new users would come with proofs and you could spend some time helping them write better proofs and they'd learn things and gain skills, but now it's not clear that this is time well spent on my part. Copilot is not going to learn from my feedback.

link