| This is a really fascinating approach, and I appreciate you sharing your structure and thinking behind this! I hope this isn't too much of a tangent, but I've been working on building something lately, and you've given me some inspiration and ideas on how your approach could apply to something else. Lately I've been very interested in using adversarial game-playing as a way for LLMs to train themselves without RLHF. There have been some interesting papers on the subject [1], and initial results are promising. I've been working on extending this work, but I'm still just in the planning stage. The gist of the challenge involves setting up 2+ LLM agents in an adversarial relationship, and using well-defined game rules to award points to either the attacker or to the defender. This is then used in an RL setup to train the LLM. This has many advantages over RLHF -- in particular, one does not have to train a discriminator, and neither does it rely on large quantities of human-annotated data. With that as background, I really like your structure in AI Alibis, because it inspired me to solidify the rules for one of the adversarial games that I want to build that is modeled after the Gandalf AI jailbreaking game. [2] In that game, the AI is instructed to not reveal a piece of secret information, but in an RL context, I imagine that the optimal strategy (as a Defender) is to simply never answer anything. If you never answer, then you can never lose. But if we give the Defender three words -- two marked as Open Information, and only one marked as Hidden Information, then we can penalize the Defender for not replying with the free information (much like your NPCs are instructed to share information that they have about their fellow NPCs), and they are discouraged for sharing the hidden information (much like your NPCs have a secret that they don't want anyone else to know, but it can perhaps be coaxed out of them if one is clever enough). In that way, this Adversarial Gandalf game is almost like a two-player version of your larger AI Alibis game, and I thank you for your inspiration! :) [1] https://github.com/Linear95/SPAG
[2] https://github.com/HanClinto/MENTAT/blob/main/README.md#gand... |