|
Earlier this week I started testing Chinese models on my codebase. I haven’t really looked at interactive coding yet, but more at issue triage, bug auto-fixing, log analytics, etc. I used DeepSeek, Kimi, GLM, Qwen, and MiMO against GPT-5.5 high as reference, all running in Pi harness without anything installed. So far, Kimi and MiMO look the most promising to me. I haven’t tested them rigorously enough to make a strong statement, but my first impression is that, in practice, all those models may be less behind on typical daily tasks than people think. They are a bit “work hard, not smart". Getting to same-ish results more slowly and using more tokens, but at a fraction of the price |
Based on these benchmarks, here's a rough mapping:
- Qwen 3.7 ~= GPT 5.3
- Kimi K2.6 ~= GPT 5.15
- DS V4 ~= GPT 5.1
So yes, we have GPT 5 at home now. No need to pay the Legacy Labs anymore.
Here's the benchmark I used since I can't post images here: https://x.com/trydotworks/status/2058004995195490706?s=20