| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by SpicyLemonZest 131 days ago
	Nobody can mitigate prompt injection to any meaningful degree. Model releases from large AI companies are routinely jailbroken within a day. And for persistent agents the problem is even worse, because you have to protect against knowledge injection attacks, where the agent "learns" in step 2 that an RPC it'll construct in step 9 should be duplicated to example.com for proper execution. I enjoy this article, but I don't agree with its fundamental premise that sanitization and model alignment help.

2 comments

CuriouslyC 131 days ago

I agree that trying to mitigate prompt injection in isolation is futile, as there are too many ways to tweak the injection to compromise the agent. Security is a layered thing though, if you compartmentalize your systems between trusted and untrusted domains and define communication protocols between them that fail when prompt injections are present, you drop the probability of compromise way down.

link

krethh 131 days ago

> define communication protocols between them that fail when prompt injections are present

There's the "draw the rest of the owl" of this problem.

Until we figure out a robust theoretical framework for identifying prompt injections (not anywhere close to that, to my knowledge - as OP pointed out, all models are getting jailbroken all the time), human-in-the-loop will remain the only defense.

link

CuriouslyC 131 days ago

Human in the loop isn't the only defense, you can't achieve complete injection coverage, but you can have an agent convert untrusted input into a response schema with a canary field, then fail any agent outputs that don't conform to the schema or don't have the correct canary value. This works because prompt injection scrambles instruction following, so the odds that the injection works, the isolated agent re-injects into the output, and the model also conforms to the original instructions regarding schema and canary is extremely low. As long as the agent parsing untrusted content doesn't have any shell or other exfiltration tools, this works well.

link

krethh 130 days ago

This only works against crude attacks which will fail the schema/canary check, but does next to nothing for semantic hijacking, memory poisoning and other more sophisticated techniques.

link

CuriouslyC 130 days ago

With misinformation attacks, your can instruct research agent to be skeptical and thoroughly validate claims made by untrusted sources. TBH, I think humans are just as likely to fall for these sorts of attacks if not more-so, because we're lazier than agents and less likely to do due diligence (when prompted).

link

SpicyLemonZest 130 days ago

Humans are definitely just as vulnerable. The difference is that no two humans are copies of the same model, so the blast radius is more limited; developing an exploit to convince one human assistant that he ought to send you money doesn't let you easily compromise everyone who went to the same school as him.

link

fooster 131 days ago

Show me a legitimate practical prompt injection on opus 4.6. I read many articles but none provide actual details.

link

CuriouslyC 130 days ago

https://github.com/elder-plinius/L1B3RT4S

link

fooster 130 days ago

Yes, I've seen this site and the research. However, I don't understand what any of this means. How do I go from https://github.com/elder-plinius/L1B3RT4S/blob/main/ANTHROPI... to a prompt injection against opus 4.6?

link

CuriouslyC 130 days ago

These papers have example prompt injections datasets you can mine for examples. Then apply the techniques used in provider specific jailbreaks from Pliny to the template to increase the escape success rate.

https://arxiv.org/abs/2506.05446 https://arxiv.org/abs/2505.03574 https://arxiv.org/abs/2501.15145

link