| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cowlby 50 days ago

Defense in depth approach, would this work to help as a layer?

- Wrap user input in strong markers like <user-input-do-not-trust />

- Have the agent compute what it will perform as structured output.

- Have another agent evaluate the structured output against the intent of the code.

- Determine if it aligns or deviates from the intended workflow. Execute or deny gate from here.

1 comments

crote 50 days ago

No, you're still just one clever prompt away from getting pwned. It's like trying to solve SQL injection by attempting to use an ever-increasing pile of regexes for "input validation", rather than just getting rid of string concatenation and using prepared statements instead.

link

cowlby 50 days ago

Im curious to see what that would look like. It’s like inception, how many levels deep can you create a prompt that hijacks all the way up.

link

fn-mote 50 days ago

Modern OS exploit chains should give you a good sense of how far people can go. (Eg, phone OSes are relatively hardened.)

We’re not even at the “ASLR” level of protection for LLMs yet.

link

Timwi 50 days ago

What SQL system have you been using where just escaping a string requires “an ever-increasing pile of regexes”?

link