| For all of the excitement about "autonomous AI agents" that go ahead and operate independently through multiple steps to perform tasks on behalf of users, I've seen very little convincing discussion about what to do about this problem. Fundamentally, LLMs are gullible. They follow instructions that make it into their token context, with little regard for the source of those instructions. This dramatically limits their utility for any form of "autonomous" action. What use is an AI assistant if it falls for the first malicious email / web page / screen capture it comes across that tells it to forward your private emails or purchase things on your behalf? (I've been writing about this problem for two years now, and the state of the art in terms of mitigations has not advanced very much at all in that time: https://simonwillison.net/tags/prompt-injection/) |
I'd say that the fundamental problem is mixing command & data channels. If you remember the early days of dial-up, you could disconnect anyone from the internet by sending them a ping with a ATH0 command as payload. That got eventually solved, but it was fun for a while.
We need LLMs to be "gullible" as you say, and follow commands. We don't need them to follow commands from data. ATM most implementations use the same channel (i.e. text) for both. Once that is solved, these kinds of problems will go away. It's unclear now how this will be solved, tho...