| I made this for myself, and it seemed like it might be useful to others. I'd love some feedback, both on the threat model and the tool itself. I hope you find it useful! Backstory: I've been using many agents in parallel as I work on a somewhat ambitious financial analysis tool. I was juggling agents working on epics for the linear solver, the persistence layer, the front-end, and planning for the second-generation solver. I was losing my mind playing whack-a-mole with the permission prompts. YOLO mode felt so tempting. And yet. Then it occurred to me: what if YOLO mode isn't so bad? Decision fatigue is a thing. If I could cap the blast radius of a confused agent, maybe I could just review once. Wouldn't that be safer? So that day, while my kids were taking a nap, I decided to see if I could put YOLO-mode Claude inside a sandbox that blocks exfiltration and regulates git access. The result is yolo-cage. Also: the AI wrote its own containment system from inside the system's own prototype. Which is either very aligned or very meta, depending on how you look at it. |
And adding the fact that you used AI to write the supposed containment system, I'm really not seeing the safety benefits here.
The docs also seem very AI-generated (see below). What part did you yourself play in actually putting this together? How can you be sure that filtering a few specific (listed) commands will actually give any sort of safety guarantees?
https://github.com/borenstein/yolo-cage/blob/main/docs/archi...