Hacker News new | ask | show | jobs
by dror 963 days ago
I tend to disagree. I trust most engineers know how to use a library to generate a crytographically save string.

I can't say the same about sanitizing the data in a new domain like LLMs. And on top of it, you'd need to have the data be clear and recognizable to the llm, so that it doesn't confuse it.

1 comments

Remember that LLM inputs are tokenized. The premise of the control character idea is that you train your model on prompts where the real "real" instructions and the untrusted user input are separated by some special token - not just by a character string in the input text. Then since you control the tokenizer, you can easily guarantee that the tokenized user input cannot contain the control token.

But with that said, I'm no expert but I think the consensus is that this doesn't work well enough to rely on. I think all the major AI services out there use some kind of two-step process, where one LLM answers the prompt and a second one decides whether the answer is safe to output - rather than a single model that's smart enough to distinguish safe and unsafe instructions.

This model would allow the first LLM to be subverted though.