| None of these tools measure how effective they are... It's a massive red flag to me when you could get decent data to see if your thing actually works, and they don't even attempt to... Have the LLM use your tool, run it on several of the coding benchmarks. If you're stingy, run it on the ones that don't cost much. Otherwise, I'm going to assume it doesn't actually work. If it did - Claude, Antigravity, Codex, Pi, or some major player would bundle tools like this into the CLI / harness. AFAIK, none of the major players do. That's a sign to me these don't work in general. I've tried building some tools specific to bug fixing. Intelligently feeding context massively helps smaller models. But, what I've found - surprisingly - is that a smaller, much better focused, including a lot of helpful data as well, has almost no impact on larger models compared to what they do by default. You do save some tokens, though, which is what they're claiming - but not ~99%... |
None of the major players are incentivized to care about this, especially not over other opportunities. Why would you expect them to integrate it?
One of the biggest wins you can institute for your own codebase if you use agents is writing your own harness, by a huge margin. The defaults are fine, but you can do better.