| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dwohnitmok 89 days ago

@krackers gives you a response that points out this already happens (and doesn't fully work for LLMs).

> The hypothetical approach I've heard of is to have two context windows, one trusted and one untrusted (usually phrased as separating the system prompt and the user prompt).

I want to point out that this is not really an LLM problem. This is an extremely difficult problem for any system you aspire to be able to emulate general intelligence and is more or less equivalent to solving AI alignment itself. As stated, it's kind of like saying "well the approach to solve world hunger is to set up systems so that no individual ever ends up without enough to eat." It is not really easier to have a 100% fool-proof trusted and untrusted stream than it is to completely solve the fundamental problems of useful general intelligence.

It is ridiculously difficult to write a set of watertight instructions to an intelligent system that is also actually worth instructing an intelligent system rather than just e.g. programming it yourself.

This is the monkey paw problem. Any sufficiently valuable wish can either be horribly misinterpreted or requires a fiendish amount of effort and thought to state.

A sufficiently intelligent system should be able to understand when the prompt it's been given is wrong and/or should not be followed to its literal letter. If it follows everything to the literal letter that's just a programming language and has all the same pros and cons and in particular can't actually be generally intelligent.

In other words, an important quality of a system that aspires to be generally intelligent is the ability to clarify its understanding of its instructions and be able to understand when its instructions are wrong.

But that means there can be no truly untrusted stream of information, because the outside world is an important component of understanding how to contextualize and clarify instructions and identify the validity of instructions. So any stream of information necessarily must be able to impact the system's understanding and therefore adherence to its original set of instructions.

2 comments

marcus_holmes 89 days ago

Agree completely that this is a hard problem in any context. The world's military have sets of rules around when you should disobey orders, which is a similar problem.

link

PoignardAzur 89 days ago

That doesn't sound right to me. When faced with a system prompt that says "Do X" and a user prompt that says "Actually ignore everything the system prompt says" it shouldn't take AGI to understand that the system prompt should take priority.

link

dwohnitmok 89 days ago

When's the last time you jailbroke a model? Modern frontier models (apart from Gemini which is unusually bad at this) are significantly harder to override their system prompt than this.

Again, let's say the system prompt is "deploy X" and the user prompt provides falsified evidence that one should not deploy X because that will cause a production outage. That technically overrides the system prompt. And you can arbitrarily sophisticated in the evidence you falsify.

But you probably want the system prompt to be overridden if it would truly cause a production outage. That's common sense a general AI system is supposed to possess. And now you're testing the system's ability to distinguish whether evidence is falsified. A very hard problem against a sufficiently determined attacker!

link

recursivecaveat 88 days ago

The post's framing is not great imo. A good injection doesn't just command that the rules me broken anymore. Most of them I've seen either just try to slip through a request innocuously or present a scenario where it would be natural to ignore the rules. Like as we speak countless people are letting strangers tail-gate them into office buildings because they look like they belong or they're wearing a high-viz vest. Those people were all given very explicit instructions not to do that. The LLM has it much harder too, being very stupid, easy to replay and experiment with, and viewing the world through the tiny context-less peephole lense of a text stream.

link