| > How does the MCTS distinguish between 'generated a stupid lemma that is true' and 'generated a valid lemma'? It does not, though I would like to add validation as part of the check as future work. > Is there any reason to expect that a 'good' partial subtree will result in a 'good' output? Yes. I am using Phind Code Llama and it generates a lot of invalid Dafny out of the box. Now, all these false starts get pruned. I do suspect that if we try to scale this to Coq proof, we will run in local optima rabbit holes. > Why do you think this approach will generate valid lemmas?
> (Not structurally valid; valid as in, they assert a solution to the given problem). Because the LLM tries to follow instructions. > "Generate me a function for X and a bunch of tests that verify X".
> "If the tests pass, the function is good"
> ...but, there's no specific reason to expect that to be true? Indeed, there are complementary approaches like Speculyzer that help validate specs and code at once. > How is what you're doing different? That's not the problem I am solving at the moment. What I solve is that I am able to generate verified code (that still need to be inspected) while the LLM struggles to do in one go. > I guess I'm struggling to see how this is superior / novel to existing approaches in that space. It's not that novel, see the planning in LLM work that I credit in the README.md. This transfers from a setting of checks by tests to a setting of checks by a verifier.
The code is very simple, and it actually brings open models to reliably solve problems that even ChatGPT only sometimes solve.
I think it's a good foundation to build more sophisticated verification with LLMs. |