Hacker News new | ask | show | jobs
by senko 1179 days ago
The thing with AlphaGo Zero is that there is a clear external arbiter of which side of the internal debate wins, so the algorithm can learn.

For LLM to use the technique on the kind of reasoning you talk about, you need a human in the loop to explain it why it's wrong or right, otherwise it just hallucinates random stuff.

That's basically what RLHF[0] is, which was used to great success in training ChatGPT.

[0] https://huggingface.co/blog/rlhf

1 comments

Thanks! The interesting thing is that my casual observations indicate that GPT itself might already be good enough to self-arbiter itself. Just like a human writer can improve its own writing by iterating over it. In a sense, having humans in the loop were what it took (past) to gain the possibility to reach self-arbitration capacity.