|
They all have. I don't hope to convince you of that, everyones use case differs. Generally, AIME / prose / code benchmarks that don't involve successive tool calls are used to hide some very dark realities. IMHO tool calling is by far the most clearly economically valuable function for an LLM, and r1 self-admittedly just...couldn't do it. There's a lot of puff out there that's just completely misaligned with reality, ex. Gemini 2.5 Pro is by far the worst tool caller, Gemini 2.5 Flash thinking is better, 2.5 Flash is even better. And either Llama 4 beats all Gemini 2.5s except 2.5 Flash not thinking. I'm all for "these differences will net out in the long run", Google's at least figured out how to micro optimize for Aider edit formatting without tools. Over the last 3 months, they're up 10% on edit performance. But it's horrible UX to have these specially formatted code blocks in the middle of prose. They desperately need to clean up their absurd tool-calling system. But I've been saying that for a year now. And they don't take it seriously, at all. One of their most visible leads tweeted "hey what are the best edit formats?" and a day later is tweeting the official guide for doing edits. I'm a Xoogler and that absolutely reeks of BigCo dysfunction - someone realized a problem 2 months after release and now we have "fixed" it without training, and now that's the right way to do things. Because if it isn't, well, what would we do? Shrugs I'm also unsure how much longer it's worth giving a pass on this stuff. Everyone is competing on agentic stuff because that's the golden goose, real automation, and that needs tools. It would be utterly unsurprising to me for Google to keep missing a pain signal on this, vis a vis Anthropic, which doubled down on it mid-2024. As long as I'm dumping info, BFCL is not a good proxy for this quality. Think "converts prose to JSON" not "file reading and editing" |
I am not really invested in this niche topic but I will observe that, yes I agree Llama 4 is really good here. And yet it's a far worse coder, far less intelligent than DeepSeek and that's not even arguable. So no it didn't "catch up" any more than what you could say by pointing out Llama is multimodal but DeepSeek isn't. That's just talking about a different things entirely.
Regardless, I do agree BFCL is not the best measure either, the Tau-bench is more real world relevant. But end of the day, most frontier labs are not incentive aligned to care about this. Meta cares because this is something Zuck personally cares about, Llama models are actually for small businesses solving grunt automation, not for random people coding at home. People like Salesforce care (xLAM), even China had GLM before DeepSeek was a thing. DeepSeek might care so long as it looks good for coding benchmarks, but that's pretty much the extent of it.
And I suspect Google doesn't truly care because in the long run they want to build everything themselves. They already have a CodeAssist product around coding which likely uses fine-tune of their mainline Gemini models to do something even more specific to their plugin.
There is a possibility that at the frontier, models are struggling to be better in a specific and constrained way, without getting worse at other things. It's either this, or even Anthropic has gone rogue because their Aider scores are way down now from before. How does that make sense if they are supposed to be all around better at agentic stuff in tool agnostic way? Then you realise they now have Claude Coder and it just makes way more economic sense to tie yourself to that, be context inefficient to your heart's content so that you can burn tokens instead of being, you know, just generally better.