| I wonder whether we could combine formal proof checkers (like the Lean proof checker) with language models that generate synthetic conjecture-proof pairs in a formal language like Lean. The Lean proof checker could be used to automatically verify whether the synthetic proofs written by the language model are correct. This information could be used to provide an RL reward signal applied to the original language model, which would result in it writing better proofs. (Or we train a new model using the correct synthetic proofs of the previous round as training data.) And then the process repeats. So the model would self-train using its synthetic training data, without further human intervention. We could even make this process more adversarial. First we split the generator language model into two: One which generates conjectures, and one which tries to prove/disprove them in Lean. Then add a predictor model which tries to predict whether a synthetic proof is verified by the Lean proof checker. The lower the predicted probability that the proof will be correct, the more reward gets the proof-generator model if it did indeed provide a correct proof. Finally, we add another model which tries to predict the reward the proof-generator model will get for a given synthetic conjecture. Then the conjecture-generator model is rewarded for conjectures that are predicted to yield a high reward in the proof-generator model. So conjectures that are neither too hard not too easy for the proof-generator model. So we would expect that the whole system would progressively create harder and harder synthetic proofs, which in turn allows for better and better self-training of the proof-generator. It seems this could in principle scale to superhuman ability in generating proofs. The process would be somewhat similar GANs or to self-play in AlphaGo Zero. I think the hard part is the initial bootstrapping part, to get the whole process off the ground. Because the initial training of the generator models has to be done with human provided training data (Lean proofs). But once the synthetic proofs are good enough, the system would self-train itself automatically. |
https://leanprover.zulipchat.com/#streams/219941/Machine%20L...