| This is neat, and matches an observation I saw with early Claude Code usage: Sonnet would often call tools quickly to gather more context, whereas Opus would spend more time reasoning and trying to solve a problem with the context it had. This led to lots of duplicated functions and slower development, though the new models (GPT-5.5 and Opus 4.6) seem to suffer from this less. My takeaway was that “dumber” (i.e. smaller) models might be better as an agentic harness, or at least feasibly cheaper/faster to run for a large swath of problems. I haven’t found Gemini to be particularly good at long horizon tool calling though. It might be interesting to distill traces from real Codex or Claude code sessions, where there’s long chains of tool calls between each user query. Personally, I’d love a slightly larger model that runs easily on an e.g. 32GB M2 MBP, but with tool calling RL as the primary focus. Some of the open weight models are getting close (Kimi, Qwen), but the quantization required to fit them on smaller machines seems to drop performance substantially. |
I have a suite or tools ive built for myself on top of the openrouter api for very specific tasks. Press button amd LLM does (one) useful thing, not press button and let LLM run tool calls in a loop for 5 minutes and hope it does things in the correct order.
If multiple tools need to be called to do a useful thing, I will chain those together deterministically in my code. This is much more reliable as I can check the output of A before proceeding to task B or C, also its more time and token efficient. Agentic loops are a huge scam.