| HN Mirror

Remember that LLM inputs are tokenized. The premise of the control character idea is that you train your model on prompts where the real "real" instructions and the untrusted user input are separated by some special token - not just by a character string in the input text. Then since you control the tokenizer, you can easily guarantee that the tokenized user input cannot contain the control token.

But with that said, I'm no expert but I think the consensus is that this doesn't work well enough to rely on. I think all the major AI services out there use some kind of two-step process, where one LLM answers the prompt and a second one decides whether the answer is safe to output - rather than a single model that's smart enough to distinguish safe and unsafe instructions.