Hacker News new | ask | show | jobs
by ashwinr2002 143 days ago
Apologies for the formatting

Can we not assume that the plan you just said “ok” to came from a user prompts you made earlier in the chat session and hence does influence this decision process.

Another point in the idea is that this trusted context can include even the AI replies up until there hasnt been a tool calls yet that brings back a response an attacker can control

But it’s entirely possible that there are edge cases here, a red teaming dataset to cover these cases shouldn’t be hard to create