| It won't. TL;DR: you can't RLHF immunity to gaslighting without breaking the model. I'm calling it here, though I doubt this comment will get cited when someone actually proves this in a formal way: no amount of instruction tuning or RLHF can make LLM immune to having its system instructions overriden, without destroying its ability to complete inputs outside its training dataset. Sure, you could RLHF it to ignore any known prompt attack, and the final model may even get good at recognizing and rejecting those. Sure, you could introduce some magic token and fine-tune the AI until it becomes asymptotically impossible for it to confuse what tokens are "trusted" and what tokens are "tainted". But the LLM still reasons about / completes all those prompts together. There is no clear code/data separation here - neither in the LLM architecture, nor in natural language itself (nor, fundamentally, in our physical reality). User and system inputs always blend together to drive the LLM - meaning user input is always in a position to override system data. If you can't make the LLM ignore its system instructions, or can't get it to follow your own, you can always try and make it change its understanding of the concepts in its system rules. What happens if my input redefines what "being nice" means, in general, never once referring to the bot or its rules? You can RLHF that away, but can you RLHF away every possible way any of the concepts in your system rules could be implicitly redefined during a conversation? Fictional example, to illustrate what I'm talking about: in one episode of Star Trek: Lower Decks, some of the heroes were stuck on a shuttle with a locked-out autopilot, set to "take them home" (i.e. Earth). They couldn't override it or reset it so it takes them to their desired destination, so instead they redefined what the computer understood as "home" in this context, changing it from "Earth" to another destination - and watched as the autopilot changed course, continuing to obey its system prompt, to "take the ship home". |
I have a language model as a component of myself, and I don't expect my language model to perform these tasks. It's a separate system that takes my language model's comprehension of "Simon says raise your right hand" and decides whether or not the right hand should be raised... and I pick that example precisely as a demonstration that it isn't perfect either. But it's not my language model alone performing that task.