Hacker News new | ask | show | jobs
by terminalbraid 9 days ago
But the agent could be trained on sensitive data that could leak which could enable a different attack.

Saying it's safe to "ignore" anything that exposes information is dangerous. You might as well claim social engineering isn't real as long as the person doesn't have direct access to the thing you want.

3 comments

They are suggesting that you should assume the user has full access to the same tools as the agent, which is a helpful way to approach it. You mentioned the prompt side of things, and I think you should use a similar mindset there—just assume the user can read the entire prompt exactly as it’s sent.
You should also assume the user can read any data you send back from a tool call or data you add to a user response. If any part of the input or output is controllable by an attacker, you should be assuming some prompt injection is possible that allows them to access all data and tool calls the agent had and has access to.
Yes, that's part of the "entire prompt"
Agreed. The agent and tools are different types of vulnerabilities. Both are important especially if you have dedicated finetuning (which won't be user dependent of course).

But also stuff like RAG: usually support agents have access to all internal support kbase material. Including stuff you don't want to leak verbatim. And there's other things to consider too like your agent being used to run other people's prompts. Not a data loss issue but could be a financial issue.

But yes I do agree that for the tools' security the agent shouldn't be considered as part of the security model. Any protections there are nice to have but shouldn't be relied upon.

> Including stuff you don't want to leak verbatim

This is exactly what I mean; if you give your agent access to some knowledge base through RAG; you should assume that this knowledge is now public information. If you don't want it to leak, design your agent so that it doesn't have access to it.

That's yet another class of attack and a pretty rare one. Very few agents run on fine-tuned models, but even for those that do, the same framing exists there. You should assume that anything that goes into the training data must be considered public information.