DeepSeek v4 Pro struggles with a custom harness, and all the models ranked above it don't, so it gets downweighted in the agentic coding benchmarks (although it ranks better than Flash in one-shot problem solving: https://gertlabs.com/rankings?ow=1&mode=oneshot_coding). We ran plenty of samples.
MiMo v2.5 is on there, as well as the pro version.
We found a few anomalies in our evaluations, which makes sense -- if every new sub-release is better across the board in every area of the model card, that should raise alarms about benchmaxxing. But the main thing we found is that hype != performance, and I trust our benchmark methodology significantly more than the model cards the labs add to their press releases.
Mimo struggles with my custom harness. (Ignores the instructions and defaults back to its own preferred tool calling syntax.)
Flash handles it fine, which I found amusing. (Since Mimo is supposed to be opus level!) But Flash seems to work even better in Claude Code...
With smaller models I always have the issue of needing to adapt myself to their preferred workflow... which sort of defeats the purpose. Price is hard to beat tho :)
Mimo v2.5 non-pro seems to do better with tool usage than its Pro sibling, is much cheaper and solves 90% of the same problems. I use Pro only for one-off tasks that require complex reasoning: memory management bugs, algorithms, planning.
When it gets stuck, I get one-shot advice from Claude or DS Pro. I’ve done massive amounts of work for cheap this way.
Thanks. I found One Weird Trick to make Mimo v2.5 Pro work in my harness, which is that I just added an example bash tool usage to the system prompt. Now it works fine.
The issue was that my previous instructions had <command> as a placeholder. But the model started wrapping bash commands in <command></command> tags... haha. Now that it has an actual example it just works properly.
It's likely overfit to common harnesses and iteration patterns, so it struggles with formatting tool calls and json in our testing which use our own harnesses (although there is a lot of overlap with tools that would be found in any coding harness like bash, apply_patch, etc.)
We didn't love the results because it draws negative scrutiny to our benchmark, but the results are real and done at scale and I think DeepSeek V4 Pro's inability to do agentic work outside of environments it was trained on is an important thing to measure, especially when so many other models can generalize to new environments just fine.
Google models also struggle with tools, but they have very strong initial answers, so there is more potential for them to bridge the gap with some better post-training.
MiMo v2.5 is on there, as well as the pro version.
We found a few anomalies in our evaluations, which makes sense -- if every new sub-release is better across the board in every area of the model card, that should raise alarms about benchmaxxing. But the main thing we found is that hype != performance, and I trust our benchmark methodology significantly more than the model cards the labs add to their press releases.