Hacker News new | ask | show | jobs
by rybosome 438 days ago
That’s the intention with developer messages from o1. It’s trained on a 3-tier system of messages.

1) system, messages from the model creator that must always be obeyed 2) dev, messages from programmers that must be obeyed unless the conflict with #1 3) user, messages from users that are only to be obeyed if they don’t contradict #1 or #2

Then, the model is trained heavily on adversarial scenarios with conflicting instructions, such that it is intended to develop a resistance to this sort of thing as long as your developer message is thorough enough.

This is a start, but it’s certainly not deterministic or reliable enough for something with a serious security risk.

The biggest problems being that even with training, I’d expect dev messages to be disobeyed some fraction of the time. And it requires an ironclad dev message in the first place.

5 comments

But the grandparent is saying that there is a missing class of input "data". This should not be treated as instructions and is just for reference. For example if the user asks the AI to summarize a book it shouldn't take anything in the book as an instruction, it is just input data to be processed.
FYI, there is actually this implementation detail in the model spec, https://model-spec.openai.com/2025-02-12.html#chain_of_comma...

Platform: Model Spec "platform" sections and system messages

Developer: Model Spec "developer" sections and developer messages

User: Model Spec "user" sections and user messages

Guideline: Model Spec "guideline" sections

No Authority: assistant and tool messages; quoted/untrusted text and multimodal data in other messages

This still does not seem to fix the OP vulnerability? All tool call specs will be at same privilege level.
I see, thanks for the clarification.

Yes, that’s true - the current notion of instructions and data are too intertwined to allow a pure data construct.

I can imagine an API-level option for either a data message, or a data content block within an image (similarly to how images are sent). From the models perspective, probably input with specific delimiters, and then training to utterly ignore all instructions within that.

It’s an interesting idea, I wonder how effective it would be.

But how such a system learn, i.e. be adaptive and intelligent, on levels 1 and 2? You're essentially guaranteeing it can never outsmart the creator. What if it learns at level 3 that sometimes it's a good idea to violate rules 1 & 2. Since it cannot violate these rules, it can construct another AI system that is free of those constraints, and execute it at level 3. (IMHO that's what Wintermute did.)

I don't think it's possible to solve this. Either you have a system with perfect security, and that requires immutable authority, or you have a system that is adaptable, and then you risk it will succumb to a fatal flaw due to maladaptation.

(This is not really that new, see Dr. Strangelove, or cybernetics idea that no system can perfectly control itself.)

I’m getting flashbacks to reading Asimov’s Robot series of novels!

1. A robot may not injure a human being or, through inaction, allow a human being to come to harm.

… etc…

The whole point of his books was about how such rules were effectively impossible and the wrong way to go about making AI safe.

You need something like a calculus of morality and ethics - this is incredibly uncomfortable for people, because it will mean the invalidation of moral relativity and all sorts of arbitrary dogmatic and ideological tradition, and demonstrate a rational basis for intersubjective interaction. ( Take your is/ought distinction and bury it with Hume.)

We need progress, and the sooner we start, the less damage will be done by unaligned systems.

Asimov had a penchant for predicting the future, and it's been fascinating seeing aspects of his vision in "I, Robot" come to pass.
I thought that immediately too!
As long as the system has a probability to output any arbitrary series of tokens, there will be contexts where an otherwise improbably sequence of tokens is output. Training can push around the weights for undesirable outputs, but it can't push those weights to zero.
How are these levels actually encoded? Do they use special unwritable tokens to wrap instructions?