The second layer you write about here, doesn’t it kind of depend on huma judgement and hence produce the same problem as having human in the loop for all tools, and hence isn’t really a layer due to decision fatigue?
Can we not assume that the plan you just said “ok” to came from a user prompts you made earlier in the chat session and hence does influence this decision process.
Another point in the idea is that this trusted context can include even the AI replies up until there hasnt been a tool calls yet that brings back a response an attacker can control
But it’s entirely possible that there are edge cases here, a red teaming dataset to cover these cases shouldn’t be hard to create
But why is it that you think you can predict if a tool name is required by user prompts and a few tool responses ?
I can also reply to the bot as simply “ok” for a plan already rolled out by the bot