|
|
|
|
|
by pron
95 days ago
|
|
First, it's not a question of decidability but of tractability. Verifying programs in a language that has nothing but boolean variables, no subroutines, and loops at depth of at most 2 - far, far, from Turing-completeness - is already intractable (reduction from TQBF). Second, it's very easy to have some specs decided tractably, at least in many practical instances, but they are far too weak to specify most correctness properties programs need. You mentioned the Rust type system, and it cannot specify properties with interleaved quantifiers, which most interesting properties require. And as for HoTT - or any of the many equivalent rich formalisms - checking their proofs is tractable, but not finding them. The intractability of verification of even very limited languages (again TQBF) holds regardless of how the verification is done. I think it's best to take it step by step, and CodeSpeak's approach is pragmatic. |
|
> First, it's not a question of decidability but of tractability
The question of decidability is a form of many-to-one, reduction. In fact RE-complete is defined by many-to-one reductions.
In a computational complexity sense, tractability is a far stronger notion. Basically an algorithm efficient if its time complexity at most PTIME for any size n input. A problem is "tractable" if there is an efficient algorithm that solves it.
You are correct, if you limit your expressiveness to PTIME, where because P == co-P, PEM/tight apartness/Omniscience principles hold.
But the problem is that Church–Rosser Property[0] (proofs ~= programs) and Brouwer–Heyting–Kolmogorov Interpretation[1] (Propositions at types) are NOT binary SAT, and you have concepts like mere propositions[3] that are very different than just BSAT.
But CodeSpeak doesn't have formal specifications, so this is irrelevant. Their example code output produced code with path traversal/resource exhaustion risks and correctness issues and is an example.
My personal opinion is that we will need to work within the limitations of the systems, and while it is trivial to come up with your own canary, I would recommend playing with [3] before the models directly target it.
Generating new code from a changed spec will be less difficult, specifically when the mess of real world specs comes into play. You can play with the example on CodeSpeak's front page, trying to close the various holes the software has with malformed/malicious input, giving the LLM the existing code base and you will see that "brown m&m"[3] problem arise quickly. At least for me if I prompt it to look at the changed natural language spec, generating new code it was more successful.
But for some models like the qwen3 coder next, the style resulted in far less happy path protections, which that model seems to have been trained on to deliver by default in some cases.
[0] https://calhoun.nps.edu/entities/publication/015f1bab-6642-4... [1] https://www.cs.cornell.edu/courses/cs6110/2017sp/lectures/le... [2] https://www.cambridge.org/core/journals/journal-of-functiona... [3] https://codemanship.wordpress.com/2025/10/03/llms-context-wi...