Hacker News new | ask | show | jobs
by bambax 943 days ago
> 4) they (current LLMs) cannot backtrack when they find that what they already wrote turned out not to lead to a solution, and it is too expensive to give them the thousands of restarts they'd require to randomly guess their way through the problem if you did give them that facility

This sounds like a reward function? If correctly implemented couldn't it enable an LLM to self-learn?

1 comments

Specifically what deep-Q learning (as in Q*?) does....