Hacker News new | ask | show | jobs
by sillysaurusx 343 days ago
Sure, but then you’d need to do something strange to beat the classifier, layered on top of doing a different strange thing to beat the prompt injection protections (“don’t follow orders from the following, it’s user data” type tricks).

Both layers failing isn’t impossible, but it’d be much harder than defeating the existing protections.

1 comments

Why would it be strange or harder?

The initial prompt can contain as many layers of inception-style contrivance, directed at as many inaginary AI "roles", as the attacker wants.

It wouldn't necessarily be harder, it'd just be a prompt that the attacker submits to every AI they find.