This resonates. The problem I keep running into isn't that the model is bad — it's that the feedback loop is too thin. A y/n in the terminal isn't enough to catch when the model does something subtly wrong.
I've been building a review UI layer for coding agents (Claude Code, Codex) that lets you actually inspect and edit what the agent is about to do before it executes: https://github.com/agentlayer-io/AgentClick
Turns out most of the "dumb" mistakes OP is talking about are catchable — you just need to actually see them before they ship.
Turns out most of the "dumb" mistakes OP is talking about are catchable — you just need to actually see them before they ship.