Hacker News new | ask | show | jobs
by simonw 979 days ago
I don't think this is about alignment (does that term have a robust definition?) - the problem with prompt injection is that the LLM exactly follows the instructions it has been given... but is unable to tell the difference between trusted and untrusted inputs.

I think this is fundamentally about gullibility. LLMs are gullible: they believe everything in their training data, and then they believe everything that is fed to them. But that means that if we feed them untrusted inputs they'll believe those too!

1 comments

I'm talking about “alignment” in the broad sense of aligning the actions of one intelligence to the goals of another.

Humans are in general not aligned, not to each other, and not to the survival of their species, not to all the other life on earth, and often not even to themselves individually. When a man is murdered, it is because his desire to live is misaligned with the perpetrator's desire to kill.

>and then they believe everything that is fed to them.

See but here's the thing...They don't.

GPT-3 will ignore tools when it disagrees with them - https://vgel.me/posts/tools-not-needed/

It's not a fundamental issue of gullibility. Reducing gullibility will reduce injection but it's not going to solve it.