| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simonw 377 days ago

The problem is that doesn't work. LLMs cannot distinguish between instructions and data - everything ends up in the same stream of tokens.

System prompts are meant to help here - you put your instructions in the system prompt and your data in the regular prompt - but that's not airtight: I've seen plenty of evidence that regular prompts can over-rule system prompts if they try hard enough.

This is why prompt injection is called that - it's named after SQL injection, because the flaw is the same: concatenating together trusted and untrusted strings.

Unlike SQL injection we don't have an equivalent of correctly escaping or parameterizing strings though, which is why the problem persists.

2 comments

tedunangst 377 days ago

People will never give up the dream that we can secure the LLM by saying please one more time than the attacker.

link

NeutralCrane 376 days ago

No this is pretty much solved at this point. You simply have a secondary model/agent act as an arbitrator for every user input. The user input gets preprocessed into a standardized, formatted text representation (not a raw user message), and the arbitrator flags attempts at jailbreaking, prior to the primary agent/workflow being able to act on the user input.

link

simonw 376 days ago

That doesn't work either. It's always possible to come up with an attack which subverts the "moderator" model first.

Using non-deterministic AI to protect against attacks against non-deterministic AI is a bad approach.

link

K0balt 376 days ago

So you just need another agent to review the data being passed to the protector agent. Easy-peasy.

Use my openAI referral code #LETITRAIN for 10% off!

link