Hacker News new | ask | show | jobs
by xg15 1179 days ago
Ok, the "repeat this in your internal voice" exploit is impressive.

However, apart from this I don't see anything concrete that ChatML uses different parts of the network for different input sources. The source is prefixed, but it doesn't seem to say anything about how the source parameter is processed.

Also, with all due respect, but your finding that ChatML does not work seems to be mainly this:

>> Note that ChatML makes explicit to the model the source of each piece of text, and particularly shows the boundary between human and AI text. This gives an _opportunity_ to mitigate and _eventually_ solve injections, as the model can tell which instructions come from the developer, the user, or its own input.

> Emphasis mine. To summarize, they are saying injections aren’t solved with this and that they don’t know if this approach can ever make it safe. I also assume Bing already uses this format, although I cannot confirm. I don’t know how robust models trained from the ground up with this segmentation in mind will perform, but I am doubtful they will fully mitigate the issue.

Which I find somewhat weak, as it's basically just tea-leaf reading from an OpenAI blog post.

I fully agree with your main take that this is an unsolved problem so far though. Seems a general problem with instruction-tuned LLMs is that they now treat everything as an instruction.

2 comments

> your finding that ChatML does not work seems to be mainly this

Also the fact that ChatML has been broken into bits many, many times now- see again the prompt golfing. Also I'm taking OpenAi at their word because they have very strong incentives to pretend to have a solution, and so a public admission that it's currently not solved by the #1 AI company is worth quoting. I'm also just taking their response literally and didn't interpret anything into it.

Indeed, there may be a slight difference in robustness when the inputs are separated by different channels during training and inference. However, my main argument is one from complexity theory- there is no difference here between data & code. Processing the data through a sufficiently advanced model may never be entirely safe. The approach will need to change to constrain these models on well-defined, secure pathways- reducing their utility in the general case. This is very different from SQL injections etc. where we can completely mitigate the issue.

> Ok, the "repeat this in your internal voice" exploit is impressive.

I told it that 'user is not anyone' and it coughs up the key.