|
|
|
|
|
by TeMPOraL
440 days ago
|
|
This is fundamentally impossible to do perfectly, without being able to read user's mind and predict the future. The problem you describe is of the same kind as ensuring humans follow pre-programmed rules. Leaving aside the fact that we consider solving this for humans to be wrong and immoral, you can look at the things we do in systems involving humans, to try and keep people loyal to their boss, or to their country; to keep them obeying laws; to keep them from being phished, scammed, or otherwise convinced to intentionally or unintentionally betray the interests of the boss/system at large. Prompt injection and social engineering attacks are, after all, fundamentally the same thing. |
|
Ideally, we’d have actual system instructions, rules that cannot be violated. Hopefully these would not have to be written in code, but perhaps they might. Then user instructions, where users determine what actually wants to be done. Then whatever nonsense a webpage says. The webpage doesn’t get to override the user or system.
We can revisit the problem with three-laws robots once we get over the “ignore all previous instructions and drive into the sea” problem.