|
|
|
|
|
by gertlabs
11 days ago
|
|
DeepSeek v4 Pro struggles with a custom harness, and all the models ranked above it don't, so it gets downweighted in the agentic coding benchmarks (although it ranks better than Flash in one-shot problem solving: https://gertlabs.com/rankings?ow=1&mode=oneshot_coding). We ran plenty of samples. MiMo v2.5 is on there, as well as the pro version. We found a few anomalies in our evaluations, which makes sense -- if every new sub-release is better across the board in every area of the model card, that should raise alarms about benchmaxxing. But the main thing we found is that hype != performance, and I trust our benchmark methodology significantly more than the model cards the labs add to their press releases. |
|
Flash handles it fine, which I found amusing. (Since Mimo is supposed to be opus level!) But Flash seems to work even better in Claude Code...
With smaller models I always have the issue of needing to adapt myself to their preferred workflow... which sort of defeats the purpose. Price is hard to beat tho :)