| HN Mirror

Thanks for the feedback!

> I wonder if such an adversarial game approach could be outfitted to not just well-defined games but to wholly generalizable improvements. e.g., could be used as a way to improve RLAIF potentially?

That's a good question!

Here's my (amateur) understanding of the landscape:

- RLHF: Given a mixture of unlabeled LLM responses, first gather human feedback on which response is preferred to mark them as Good or Bad. Use these annotations to train a Reward Model that attempts to model the preferences of humans on the input data. Then use this Reward Model for training the model with traditional RL techniques.

- RLAIF: Given good and bad examples of LLM responses, instead of using human feedback, use an off-the-shelf zero-shot LLM to annotate the data. Then, one can either train a traditional Reward Model using these auto-annotated samples, or else one can use the LLM to generate scores in real-time when training the models (a more "online" method of real-time scoring). In either case, each of these Reward methods can be used for training with RL.

- Adversarial Games: By limiting the scope of responses to situations where the preference of one answer vs. another can be computed with an algorithm (I.E., clearly-defined rules of a game), then we bypass the need to deal with a "fuzzy" Reward Model (whether built through traditional RLHF, or through RLAIF). The whole reason why RLAIF is a "thing" is because high-quality human-annotated data is difficult to acquire, so researchers attempt to approximate it with LLMs. But if we bypass that need and can clearly define the rules of a game, then we basically have an infinite source of high-quality annotated data -- although limited in scope to apply only to the context of a game.

If the rules of the game exist only within the boundaries of the game (such as Chess, or Go, or Starcraft), then the things learned may not generalize well outside of the game. But the expectation is that -- if the context of the game goes through the semantic language space (or through "coding space", in the context of training coding models) -- then the things that the LLM learns within the game will have general applicability in the general space.

So if I can understand your suggestion, to make a similar RLAIF-type improvement to adversarial training, then instead of using a clearly-defined game structure to define the game space, then we would use another LLM to act as the "arbiter" of the game -- perhaps by first defining the rules of a challenge, and then judging between the two competitors which response is better.

Instead of needing to write code to say "Player A wins" or "Player B wins", using an LLM to determine that would shortcut that.

That's an interesting idea, and I need to mull it over. My first thought is that -- I was trying to get away from "fuzzy" reward models and instead use something that is deterministically "perfect". But maybe the advantage of being able to move more quickly (and explore more complex game spaces) would outweigh it.

I need to think this through. There are some situations where I could really see your generalized approach working quite well (such as the proposed "Adversarial Gandalf" game -- using an LLM as the arbiter would probably work quite well), but there are others where using an outside tool (such as a compiler, in the case of the code-vulnerability challenges) would still be necessary.

I wasn't aware of the RLAIF paper before -- thank you for the link! You've given me a lot to think about, and I really appreciate the dialog!