Hacker News new | ask | show | jobs
by btilly 526 days ago
Its own answers, with feedback about whether the answers seem to have worked.

Learning to predict what word will lead to a successful solution (rather than just looking like existing speech) may prove to be a richer dataset than SO originally was.

1 comments

> Its own answers, with feedback about whether the answers seem to have worked.

Unless the feedback from the failing code review is piped back into the model it will still repeat the same garbage.

Most of the time this would happen in the form of an interactive debugging session, with immediate feedback.

Code review is its own domain. In general at some point LLMs need to be trained with a self-evaluation loop. Currently their training data contains a lot of "smart and knowledgeable human tries to explain things". And they average out to conversation that is "smart and knowledgeable...about everything". That won't get us to, "Recognizably thinks of things that no human would have." For that we need to get it producing content that is recognizably higher than human quality.

For that we should find ways to optimize existing models for an evaluation function that says, "Will do really well on self-review." Then it can learn to not just give answers that help with interactive debugging, but actually give answers that will also do well with more strenuous code review. Which it taught itself how to do in a similar way to how AlphaZero manages to teach itself game strategies.