| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by isotypic 583 days ago
	> and finally find a path to the solution. But how does the student, or in your case the LLM, know that it actually has the solution? For students, this is done by: a grader grading the homework, asking the professor at OH, working on problems with other peers who crosscheck as you go. I see no reason why this LLM produced synthetic data, without this correction factor, would not devolve into a mess of incorrect, maybe even not-even-wrong style "proofs". And then how can training on this yield anything?

1 comments

cevi 583 days ago

When you get good enough at mathematics, you can tell if your proofs are correct or not without asking a TA to grade them for you. Most mathematicians reach this level before they finish undergrad (a rare few reach it before they finish high school). While AI hasn't reached this level yet, there is no fundamental barrier stopping it from happening - and for now, researchers can use formal proof-checking software like Lean, Coq, or Isabelle to act as a grader.

(In principle, it should be also be possible to get good enough at philosophy to avoid devolving into a mess of incoherence while reasoning about concepts like "knowledge", "consciousness", and "morality". I suspect some humans have achieved that, but it seems rather difficult to tell...)

link

isotypic 581 days ago

> When you get good enough at mathematics, you can tell if your proofs are correct or not without asking a TA to grade them for you.

This is simply not true - you can get a very good sense of when your argument is correct, yes. But having graded for (graduate, even!) courses, even advanced students make mistakes. It's not limited to students, either; tons of textbooks have significant errata, and its not as if no retraction in math has ever been issued.

These get corrected by talking with other people - if you have an LLM spew out this synthetic chain-of-reasoning data, you probably get at least some wrong proofs, and if you blindly try to scale with this I would expect it to collapse.

Even tying into a proof-checker seems non-trivial to me. If you work purely in the proof-checker, you never say anything wrong - but the presentations in proof checking language is very different from textual ones, so I would anticipate issues of the LLM leveraging knowledge from, say, textbooks in its proofs. You might also run into issues of the AI playing a game against the compiler rather than building understanding (you see elements of this in the proofs produced by AlphaProof). And if you start mixing natural language and proof checkers, you've just kicked the verification can up the road a bit, since you need some way of ensuring the natural language actually matches the statements being shown by the proof checker.

I don't think these are insurmountable challenges, but I also don't think its as simple as the "generate synthetic data and scale harder" approach the parent comment thinks. Perhaps I'm wrong - time will tell.

link

cevi 581 days ago

The error rate of human mathematical work is not zero, but it does go down exponentially with the amount of time that the mathematician spends carefully thinking about the problem. Mistakes tend to be the result of typos, time pressure, or laziness - showing your work to others and having them check it over does help (it's one of the reasons we have peer review!), but is absolutely not necessary.

If the error rate is low enough - and by simply spending a constant factor more time finding and correcting errors, you can get it below one in a million - then you do get a virtuous feedback loop even without tying in a proof-checker. That's how humans have progressed, after all. While you are right to say that the proof-checker approach certainly is not trivial, it is currently much easier than you would expect - modern LLMs are surprisingly good at converting math written in English directly to math formalized in Lean.

I do think that LLMs will struggle to learn to catch their mistakes for a while. This is mostly because the art of catching mistakes on your own is not taught well (often it is not taught at all), and the data sets that modern LLMs train on probably contain very, very few examples of people applying this art.

A tangent: how do human mathematicians reliably manage to catch mistakes in proofs? Going line-by-line through a proof and checking that each line logically follows from the previous lines is what many people believe we do, but it is actually a method of last resort - something we only do if we are completely lost and have given up on concretely understanding what is going on. What we really do is build up a collection of concrete examples and counterexamples within a given subject, and watch how the steps of the proof play out in each of these test cases. This is why humans tend to become much less reliable at catching mistakes when they leave their field of expertise - they haven't yet built up the necessary library of examples to allow them to properly interact with the proofs, and must resort to reading line by line.

link