Hacker News new | ask | show | jobs
by HanClinto 705 days ago
This is a really fascinating approach, and I appreciate you sharing your structure and thinking behind this!

I hope this isn't too much of a tangent, but I've been working on building something lately, and you've given me some inspiration and ideas on how your approach could apply to something else.

Lately I've been very interested in using adversarial game-playing as a way for LLMs to train themselves without RLHF. There have been some interesting papers on the subject [1], and initial results are promising.

I've been working on extending this work, but I'm still just in the planning stage.

The gist of the challenge involves setting up 2+ LLM agents in an adversarial relationship, and using well-defined game rules to award points to either the attacker or to the defender. This is then used in an RL setup to train the LLM. This has many advantages over RLHF -- in particular, one does not have to train a discriminator, and neither does it rely on large quantities of human-annotated data.

With that as background, I really like your structure in AI Alibis, because it inspired me to solidify the rules for one of the adversarial games that I want to build that is modeled after the Gandalf AI jailbreaking game. [2]

In that game, the AI is instructed to not reveal a piece of secret information, but in an RL context, I imagine that the optimal strategy (as a Defender) is to simply never answer anything. If you never answer, then you can never lose.

But if we give the Defender three words -- two marked as Open Information, and only one marked as Hidden Information, then we can penalize the Defender for not replying with the free information (much like your NPCs are instructed to share information that they have about their fellow NPCs), and they are discouraged for sharing the hidden information (much like your NPCs have a secret that they don't want anyone else to know, but it can perhaps be coaxed out of them if one is clever enough).

In that way, this Adversarial Gandalf game is almost like a two-player version of your larger AI Alibis game, and I thank you for your inspiration! :)

[1] https://github.com/Linear95/SPAG [2] https://github.com/HanClinto/MENTAT/blob/main/README.md#gand...

2 comments

Thanks for sharing! I read your README and think it's a very interesting research path to consider. I wonder if such an adversarial game approach could be outfitted to not just well-defined games but to wholly generalizable improvements. e.g., could be used as a way to improve RLAIF potentially?
Thanks for the feedback!

> I wonder if such an adversarial game approach could be outfitted to not just well-defined games but to wholly generalizable improvements. e.g., could be used as a way to improve RLAIF potentially?

That's a good question!

Here's my (amateur) understanding of the landscape:

- RLHF: Given a mixture of unlabeled LLM responses, first gather human feedback on which response is preferred to mark them as Good or Bad. Use these annotations to train a Reward Model that attempts to model the preferences of humans on the input data. Then use this Reward Model for training the model with traditional RL techniques.

- RLAIF: Given good and bad examples of LLM responses, instead of using human feedback, use an off-the-shelf zero-shot LLM to annotate the data. Then, one can either train a traditional Reward Model using these auto-annotated samples, or else one can use the LLM to generate scores in real-time when training the models (a more "online" method of real-time scoring). In either case, each of these Reward methods can be used for training with RL.

- Adversarial Games: By limiting the scope of responses to situations where the preference of one answer vs. another can be computed with an algorithm (I.E., clearly-defined rules of a game), then we bypass the need to deal with a "fuzzy" Reward Model (whether built through traditional RLHF, or through RLAIF). The whole reason why RLAIF is a "thing" is because high-quality human-annotated data is difficult to acquire, so researchers attempt to approximate it with LLMs. But if we bypass that need and can clearly define the rules of a game, then we basically have an infinite source of high-quality annotated data -- although limited in scope to apply only to the context of a game.

If the rules of the game exist only within the boundaries of the game (such as Chess, or Go, or Starcraft), then the things learned may not generalize well outside of the game. But the expectation is that -- if the context of the game goes through the semantic language space (or through "coding space", in the context of training coding models) -- then the things that the LLM learns within the game will have general applicability in the general space.

So if I can understand your suggestion, to make a similar RLAIF-type improvement to adversarial training, then instead of using a clearly-defined game structure to define the game space, then we would use another LLM to act as the "arbiter" of the game -- perhaps by first defining the rules of a challenge, and then judging between the two competitors which response is better.

Instead of needing to write code to say "Player A wins" or "Player B wins", using an LLM to determine that would shortcut that.

That's an interesting idea, and I need to mull it over. My first thought is that -- I was trying to get away from "fuzzy" reward models and instead use something that is deterministically "perfect". But maybe the advantage of being able to move more quickly (and explore more complex game spaces) would outweigh it.

I need to think this through. There are some situations where I could really see your generalized approach working quite well (such as the proposed "Adversarial Gandalf" game -- using an LLM as the arbiter would probably work quite well), but there are others where using an outside tool (such as a compiler, in the case of the code-vulnerability challenges) would still be necessary.

I wasn't aware of the RLAIF paper before -- thank you for the link! You've given me a lot to think about, and I really appreciate the dialog!

Adversarial game playing as a way of training AI is basically the plot of War Games.
And also the breakthrough that let AlphaGo and AlphaStar make the leaps that they did.

The trouble is that those board games don't translate well to other domains. But if the game space can operate through the realm of language and semantics, then the hope is that we can tap into the adversarial growth curve, but for LLMs.

Up until now, everything that we've done has just been imitation learning (even RLHF is only a poor approximation "true" RL).