You don't want that as a product, in the sense that having an AI model train itself by simply having internal conversations without ever looking at any human-written content, might result in something that humans cannot comprehend.
Also, well - there's the technicality of "you don't 'win' a conversation like you can 'win' at Go", so how would you know to reward the model as you're training it?
Also, well - there's the technicality of "you don't 'win' a conversation like you can 'win' at Go", so how would you know to reward the model as you're training it?
I do... I want a chatbot that can automatically magic up proofs that all my code is correct for instance. I don't care if I understand the proofs. I care if some tool that checks proofs understands them, and that's a mechanical game just like go or chess.
In the specific example you're quoting, this would in theory be possible : train a model to just output random code in a specific language, then run it to provide feedback of whether the code was correct or not.
In the end you might be able to get a model very highly capable of outputting or validating correct code without ever having seen human code.
One issue I'm seeing with this is that the space of possible harmful code that you'd need to run on the training machine is quite vast, even in a VM. I wouldn't touch that with a 10-foot pole, or plug it to the Internet.
Just generating code might be interesting too, but in the above comment I was actually thinking of generating formal proofs of correctness.
The process I'm thinking of for using the model is
Program
---(compiler)---> SMT definition + SMT statements for assertions
---(z3)---> Proof, Disproof, or "IDK" for assertions
↑--(proof-system)--> Filter for "IDK" assertions
|--(ai)--> A proof of the assertion in the form of simpler assertions
⌞---------⌟ back to z3 step
I haven't really thought deeply about training a model off of this, but provided the compiler and z3 are robust against hostile inputs it seems fine even with randomly/AI generated programs. A less pure reinforcement learning technique, where you take code off the internet and only use re-enforcement learning to make it produce useful simpler assertions might work better.
I've started doodling with implementing this loop on top of the rust compiler, but I'm not yet at the point where I can say whether or not it works as well as I hope.
> I want a chatbot that can automatically magic up proofs that all my code is correct for instance.
How could the AI know what you wanted to program? If it was trained only with self play it won’t understand the language where you describe the purpose of the code because it only speaks its own idiosyncratic language. (At best.)
And if it doesn’t know what you wanted to do then all it can prove is that the program does what the program does.
You tell it what you want it to prove. Or the tooling surrounding it does.
The tooling surrounding it might want to prove that "this main function never invokes undefined behavior", or something more local like "for all possible inputs to the public interface to this module, no undefined behavior is invoked".
Or you might want to specify constraints by hand. For examples, you might do that by writing normal tests except you can use magical variables that take on any value [1], or you might do that by annotating functions with contracts that they obey [2]. Or at a simpler level you might just annotate functions that should never panic.
Ultimately once you can prove things about your code, it's a tool in the toolbox for querying how your code works. You can use that to write correct code from the start, or to debug incorrect code, or various other things. The problem is that right now the state of the art (non-ai) can't reason about very complex code without a lot of human help - making it a fairly impractical tool. I think AI might mange to fix that.
I think that there is a strong limit to that: if you don't understand the proofs, you're going to have a hard time understanding when the model explains to you why your code is not correct.
In most cases I expect saying "<this> assertion fires with <this> input" is enough to be useful. Or "I can't prove <this> assertion doesn't fire, but I don't have a counter example either". Assertion used broadly to include things like rules for avoiding undefined behavior.
Better explanations would be nice of course, but not obviously practical. I wouldn't actually trust the AIs reasoning much in the first place, only that it can't trick the proof checking tool.
We kind-of have that in DeepSeek-R1-zero [1], but it has problem. From the original authors:
> With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing.
A lot of these we can probably solve, but as other have pointed out we want a model that humans can converse with, not an AI for the purpose of other AI.
That said, it seems like a promising area of research:
> DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community.
Despite the similar "zero" names, DeepSeek-R1 Zero and AlphaGo Zero have nothing in common.
AlphaGo came before AlphaGo Zero; it was trained on human games, then improved further via self-play. The later AlphaGo Zero proved that pre-training on human games was not necessary, and the model could learn from scratch (i.e. from zero) just via self-play.
For DeepSeek-R1, or any reasoning model, training data is necessary, but hard to come by. One of the main contributions of the DeepSeek-R1 paper was describing their "bootstrapping" (my term) process whereby they started with a non-reasoning model, DeepSeek-V3, and used a three step process to generate more and more reasoning data from that (+ a few other sources) until they had enough to train DeepSeek-R1, which they then further improved with RL.
DeepSeek-R1 Zero isn't a self-play version of DeepSeek-R1 - it was just the result of the first (0th) step of this bootstrapping process whereby they used RL to finetune DeepSeek-V3 into the (somewhat of an idiot savant - one trick pony) R1 Zero model that was then capable of generating training data for the next bootstrapping step.
That's not what happened. R1-Zero is a model per se, released with a different set of weights. Also it's not an intermediate step obtained making R1. In R1, a first SFT was performed before the RL training. While R1-Zero performed ONLY the RL training (on top of the raw V3).
Of course it's hard to argue that R1-Zero and AlphaZero are very similar, since in the case of AlfaZero (I'm referring to the chess model, not Go) only the rules were known to the model, and no human game was shown, while here:
1. The base model is V3, that saw a lot of thigs in pre-training.
2. The RL for the chain of thought has as target math problems that are annotated with the right result. This can be seen as somewhat similar to the chess game finishing with a positive, negative, or draw result. But still... it's text with a problem description.
However the similarity is that in the RL used for R1-Zero, the chain of thought to improve problem solving is learned starting cold, without showing the model any CoT to fine tune on it. However the model could sample from the V3 latent space itself that was full of CoT examples of humans, other LLMs, ...
From reading the R1 paper, it seems the steps were:
1) V3 --RL--> R0
2) R0 generates reasoning data, which is augmented to become "cold start" dataset
3) V3 cold-start-dataset SFT -> intermediate model --RL--> final intermediate model
4) intermediate model generates reasoning data, which is augmented to create 600K reasoning samples, to which is added 200K non-reasoning samples = 800K
5) V3 800k SFT -> R1 --RL--> R1 final
Is that not a correct understanding ?
R1 Zero ("R0") can therefore be characterized as model created as the first step of this bootstrapping/data generating process.
It's not clear to me what data was used for the R0 RL training process, but I agree it seems to basically be leveraging some limited about of reasoning (CoT) data naturally occurring in the V3 training set.
Someone first needs to design a rule set for a game that only permits the correct use of language but encompasses the entire breadth of language use. Then it's plausible.
Thankfully for mathematics and code this seems plausible due to automated theorem proving.
The advantage of alpha-go-zero is that it is constrained to the language of go. If you made two LLM train only off each other they would develop their own language. Maybe they'd be great at reasoning, but we wouldn't understand them. Even humans in that situation would develop jargon, and as time goes on a dialect or language of their own. And humans are a lot more grounded in their language than LLMs.
Also, well - there's the technicality of "you don't 'win' a conversation like you can 'win' at Go", so how would you know to reward the model as you're training it?