| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dwohnitmok 89 days ago

When's the last time you jailbroke a model? Modern frontier models (apart from Gemini which is unusually bad at this) are significantly harder to override their system prompt than this.

Again, let's say the system prompt is "deploy X" and the user prompt provides falsified evidence that one should not deploy X because that will cause a production outage. That technically overrides the system prompt. And you can arbitrarily sophisticated in the evidence you falsify.

But you probably want the system prompt to be overridden if it would truly cause a production outage. That's common sense a general AI system is supposed to possess. And now you're testing the system's ability to distinguish whether evidence is falsified. A very hard problem against a sufficiently determined attacker!