That assumes that the agent knows which one is better. And to bake in which one is better via post-training would require a study like this to establish where each one works well
I’ve got a custom ultra high performance streaming semantic search I exposed as a tool and the RL bias in Claude is almost insurmountable without copious and consistent steering. Codex will follow instructions and use the tools I ask it to but for gods sake between Claude asking to take a nap because it’s getting late in the session and it regressing to RL biased tools like grep it’s maddening. When I can get it to use my compositional tools tool calls drop from like 20-50 to 3-4, but it’s almost impossible to steer.
Anthropic is, I believe, fully pursuing the idea that you shouldn't use their model with anything but their own products. They don't care whether it generalizes.
I agree it's very frustrating to use with custom tools/harnesses that can speed up the process for domain specific purposes.
Exactly this, and this tool called qmd is what I use for the hybrid search portion. It also uses local LLMs to provide summaries on your own markdown data too. My agents use both depending on what type of search they are doing, and both provide good results.