| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by agentictrustkit 88 days ago

This is how I've come to think about it. It's less a "clever string that bypasses prompts" and more "untrusted parties are participating in your control plane." That's why purely linguistic defenses feel unsatisfying.

The architectural move that seems durable is separating capabiliity from authority. You can expose many tools (that's capability), but the agent only gets authority to invoke a narrow subset under well-defined conditions (that's the policy), and the authority needs to be revocable and auditable independently of whatever happens in that context. That's basically how we already run normal organiziations with people. Interns can see a lot but are limited on what they can do.

The practical side: Keep the model in a "Propose" role, keep execution in a deterministic gate (schema validation + policy engine + sandbox) and log the decision as a first-class artifact. What I mean by that is who or what authorized, what was considered, what side effect occured...etc. You still wont' get perfect security, but you can make the failure mode "agent asked for something dumb and got blocked" instead of "agent executied a side effect because a webpage told it to."