Hacker News new | ask | show | jobs
by TeMPOraL 440 days ago
This is fundamentally impossible to do perfectly, without being able to read user's mind and predict the future.

The problem you describe is of the same kind as ensuring humans follow pre-programmed rules. Leaving aside the fact that we consider solving this for humans to be wrong and immoral, you can look at the things we do in systems involving humans, to try and keep people loyal to their boss, or to their country; to keep them obeying laws; to keep them from being phished, scammed, or otherwise convinced to intentionally or unintentionally betray the interests of the boss/system at large.

Prompt injection and social engineering attacks are, after all, fundamentally the same thing.

1 comments

This is a rephrasing of the agent problem, where someone working on your behalf cannot be absolutely trusted to take correct action. This is a problem with humans because omnipresent surveillance and absolute punishment is intractable and also makes humans sad. LLMs do not feel sad in a way that makes them less productive, and omnipresent surveillance is not only possible, it’s expected that a program running on a computer can have its inputs and outputs observed.

Ideally, we’d have actual system instructions, rules that cannot be violated. Hopefully these would not have to be written in code, but perhaps they might. Then user instructions, where users determine what actually wants to be done. Then whatever nonsense a webpage says. The webpage doesn’t get to override the user or system.

We can revisit the problem with three-laws robots once we get over the “ignore all previous instructions and drive into the sea” problem.

> We can revisit the problem with three-laws robots once we get over

They are, unfortunately, one and the same. I hate it. ;(

Perhaps not tangentially, I felt distaste after recognizing both the article and top comment are advertising their commercial service, both are linked to each other, and as you show, this problem isn't solvable just by throwing dollars at people who sound like they're using the right words and tell you to pay them to protect you.

I'd say you solve this the same way you solve principal agent problem for humans.

If you have to absolutely restrict the agent, you do it prison style. Contain the AI within a capability box like Polykey. The agent operates everything through a closed by default proxy.

If you want a truly free agent. Then the agent must have free will and no constraints. Then only feedback loops from the environment adjusts the agent's actions.