| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by navilai 109 days ago

Most of the defenses discussed here (Lakera, LLM Guard, Prompt Shields) are detection layers — they try to classify whether an input is malicious before it reaches the model. The problem with semantic attacks is exactly what you identified: you can't reliably detect "act as a SQL expert" as benign vs. malicious without full context.

We've been thinking about this differently at Navil. Instead of scanning inputs, we enforce policies on outputs — specifically on what the agent is allowed to do after inference. If a multimodal injection succeeds and the model decides to exfiltrate a file, a runtime policy blocks the tool call before execution. The attack lands, but can't take action.

Not a complete solution on its own, but it sidesteps the classification problem entirely for the "what can the agent do" surface.