|
|
|
|
|
by natrys
367 days ago
|
|
Not disagreeing with the overarching point but: > That was the easy part Is a bit hand-wavy in that it doesn't explain why it's only DeepSeek who can do this "easy" thing, but still not Meta, Mistral or anyone else really. There are many other players who have way more compute than DeepSeek (even inside China, not even considering rest of the world), and I can assure you more or less everyone trains on synthetic data/distillation from whatever bigger model they can access. |
|
IMHO tool calling is by far the most clearly economically valuable function for an LLM, and r1 self-admittedly just...couldn't do it.
There's a lot of puff out there that's just completely misaligned with reality, ex. Gemini 2.5 Pro is by far the worst tool caller, Gemini 2.5 Flash thinking is better, 2.5 Flash is even better. And either Llama 4 beats all Gemini 2.5s except 2.5 Flash not thinking.
I'm all for "these differences will net out in the long run", Google's at least figured out how to micro optimize for Aider edit formatting without tools. Over the last 3 months, they're up 10% on edit performance. But it's horrible UX to have these specially formatted code blocks in the middle of prose. They desperately need to clean up their absurd tool-calling system. But I've been saying that for a year now. And they don't take it seriously, at all. One of their most visible leads tweeted "hey what are the best edit formats?" and a day later is tweeting the official guide for doing edits. I'm a Xoogler and that absolutely reeks of BigCo dysfunction - someone realized a problem 2 months after release and now we have "fixed" it without training, and now that's the right way to do things. Because if it isn't, well, what would we do? Shrugs
I'm also unsure how much longer it's worth giving a pass on this stuff. Everyone is competing on agentic stuff because that's the golden goose, real automation, and that needs tools. It would be utterly unsurprising to me for Google to keep missing a pain signal on this, vis a vis Anthropic, which doubled down on it mid-2024.
As long as I'm dumping info, BFCL is not a good proxy for this quality. Think "converts prose to JSON" not "file reading and editing"