Hacker News new | ask | show | jobs
by hunter2_ 962 days ago
The escape string doesn't need to be hard to guess, it can be as simple as a single character. The user interface (or whatever source of untrusted data) sanitizes that particular character before handing it off to the sensitive function, either by dropping it or escaping it such that it doesn't signal the end of untrusted data.
1 comments

I tend to disagree. I trust most engineers know how to use a library to generate a crytographically save string.

I can't say the same about sanitizing the data in a new domain like LLMs. And on top of it, you'd need to have the data be clear and recognizable to the llm, so that it doesn't confuse it.

Remember that LLM inputs are tokenized. The premise of the control character idea is that you train your model on prompts where the real "real" instructions and the untrusted user input are separated by some special token - not just by a character string in the input text. Then since you control the tokenizer, you can easily guarantee that the tokenized user input cannot contain the control token.

But with that said, I'm no expert but I think the consensus is that this doesn't work well enough to rely on. I think all the major AI services out there use some kind of two-step process, where one LLM answers the prompt and a second one decides whether the answer is safe to output - rather than a single model that's smart enough to distinguish safe and unsafe instructions.

This model would allow the first LLM to be subverted though.