Hacker News new | ask | show | jobs
by GistNoesis 696 days ago
Once you view prolog goal reaching as a game. You can apply Reinforcement Learning methodologies. The goal is writing a valid proof, aka a sequence of picking valid rules and variables assignment.

Value being estimated : The expected discounted reward of reaching the goal. The shorted the proof the better.

The sufficient statistic : The embedding representation of the current solving state (the inner state of your LLM (or any other model) that you use to make your choices). You make sure it's sufficient by being able to regenerate the state from the representation (using an auto-encoder or vae does the trick). You build this statistic across various instances of problems. This tells you what is a judicious choice of variable based on experience. Similar problems yield similar choices.

The crude estimator : All choices of have the same value therefore a random choice, The improved estimator : The choice value is conditioned on the current embedding representation of the state using a Neural Network.

You can apply Rao-Blackwell once again, based by also conditioning one-step look-ahead. (Or at the limit applying it infinitely many times by solving the bellman equation.)

(You can alternatively view each update step of your model, as an application of Rao-Blackwell theorem on your previous estimator. You have to make sure though that there is no mode collapse.)

You don't have to do it explicitly, it happens under the hood by your choice of modelisation in how to pick the decision.