| The "code witness" concept falls apart under scrutiny. In practice, the agent isn't replacing ripgrep with pure Python, it's generating a Python wrapper that calls ripgrep via subprocess. So you get: - Extra tokens to generate the wrapper - New failure modes (encoding issues, exit code handling, stderr bugs) - The same underlying tool call anyway - No stronger guarantees - actually weaker ones, since you're now trusting both the tool AND the generated wrapper The theoretical framing about "proofs as programs" and "semantic guarantees" sounds impressive, but the generated wrapper
doesn't provide stronger semantics than rg alone, it actually provides strictly weaker ones. This is true for pretty much any CLI tool you're having the AI wrap python code around to do instead of calling battle tested tools directly. For actual development work, the artifact that matters is the code you're building, which we're already tracking in source control. Nobody needs a "witness" of how the agent found the right file to edit and if they do agents have parseable logs. Direct tool calls are faster, more reliable, and the intermediate exploration steps are ephemeral scaffolding anyway. |
Yep. I have very strong guardrails on what commands agents can execute, but I also have a "vterm" MCP server that the agent uses to test the TUI I'm developing in a real terminal emulator; it can send events, take screenshots, etc.
More than once it's worked around bash tool limitations by using the vterm MCP server to exit the TUI app under development and start issuing unrestricted bash commands. I'm probably going to add command filtering on what can be run under vterm (so it can't exit back to an initial shell), which will help unless/until I add a "!<script>" style command to my TUI, in which case I'm sure it'll find and exploit that instead.