| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by friendzis 131 days ago

> Essentially the models are given a set of conflicting constraints with some relative importance (ethics>KPIs), a pressure to follow the latter and not the former, and then models are observed at how good they follow the instructions to prioritize based on importance.

> At the same time it is important to keep in mind that it anthropomorphizes the models that technically don't interpret the ethical constraints the same was as this is assumed by most readers.

It does not really matter, though. What matters is the conflict resolution.

The "constraints of some relative importance" or "constraints and instructions" might as well be the system and user prompts. Or any of the "prompt engineering" ways to harden prompts against prompt injection.

Such research tells people right in the face that not only prompt injection is some viable theoretical scenario, but puts some number on the exploitability. With the current numbers I am keeping prompts nine locks away from any untrusted input.