|
|
|
|
|
by senko
1179 days ago
|
|
The thing with AlphaGo Zero is that there is a clear external arbiter of which side of the internal debate wins, so the algorithm can learn. For LLM to use the technique on the kind of reasoning you talk about, you need a human in the loop to explain it why it's wrong or right, otherwise it just hallucinates random stuff. That's basically what RLHF[0] is, which was used to great success in training ChatGPT. [0] https://huggingface.co/blog/rlhf |
|