| HN Mirror

I'm not sure how consensus would get you significantly above human baseline. Doesn't that just get you some sort of average?

The basic problem with synthetic self-training is that we need some reward function which tells us whether a given synthetic example is good. In case of AlphaGo Zero, this was a synthetic strategy which won the game, or scored a lot. Which can be automatically detected. But how do we automatically recognize that synthetic text has "high quality"?

One case where it might work is proofs in a formal proof language which can be checked automatically via software. So if a language model is tasked to generate synthetic conjecture/proof pairs, it is possible to automatically recognize the correct ones, and use that for self-training data (unsupervised, supervised, reinforcement, I'm not sure), enabling it to recursively create more complex synthetic proofs.

A very similar approach (with some sort of unit tests instead of proofs) is described here in more detail: https://arxiv.org/abs/2207.14502 It was a while that I read it, so my description above is kinda fuzzy. It might involve some adversarial step that I missed.

One problem is to get this process off the ground (bootstrapping), which is difficult, since we need some baseline capability first to create any successful synthetic examples, and there aren't a lot of human created formal proofs which can be used as bootstrapping training data.

Another problem is that, even if it worked, this system would just be good at generating proofs. Maybe there is some amount of transfer to natural language intelligence, but I'm not sure about that.

If you have a different idea for creating a reward signal, I would be interested how it could be done.