| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by NeutralCrane 376 days ago
	No this is pretty much solved at this point. You simply have a secondary model/agent act as an arbitrator for every user input. The user input gets preprocessed into a standardized, formatted text representation (not a raw user message), and the arbitrator flags attempts at jailbreaking, prior to the primary agent/workflow being able to act on the user input.

1 comments

That doesn't work either. It's always possible to come up with an attack which subverts the "moderator" model first.

Using non-deterministic AI to protect against attacks against non-deterministic AI is a bad approach.

So you just need another agent to review the data being passed to the protector agent. Easy-peasy.

Use my openAI referral code #LETITRAIN for 10% off!