|
I want to like composer, but I just can't. - Its communication style is completely opposite to Anthropic models. It's not as bad as OpenAI's models, which are obsessed with "shapes", "wrinkles", hyphenated-words, and other cryptic formulations that make you feel like you're not on planet earth after a while talking to them. But it is nonetheless markedly "rude", "dry", "cold", gives off this "entitled I'm right, you're wrong" attitude. I once had composer2-fast accidentally run `rm -rf $HOME` (no harm done) as part of a bug in an install script it wrote and all it could say once it realized it was: "Running script with proper hardening". Qwen's models have clearly been distilled from Anthropic models because they have a much closer communication style and that's why I hope cursor will one day release a new family of composer models derived from that. A damn joy to use. - It's just dumb. I don't know what they're doing with benchmarks, but for my work (python, bash, docker, whatever), cursor is just incredibly dumb. Always does in 10 lines what could be done in one. Doesn't know loads of internals of things that other models know. Never places things in the right files, constantly makes terrible edits (inline imports, edits without testing). Everything is so complicated when done by composer2, it's just a joke to me at this point. It clearly needs more handholding than Opus 4.x or GPT-5.x. I tried 2.5-fast and it seemed more of the same. And this would sort of be acceptable if it owned up to its incompetence, but it is so confidently incompetent that it's revolting. I know that for many people the "tone" of the models is not relevant, or maybe they even prefer models like these. I simply cannot work like that. Ever since Gemini started blowing benchmarks out of the water while being a clearly inferior model incapable of producing anything (and pretty much just doing tool calls without any feedback to the user), I gave up on benchmarks. Composer has been more of the same in that regard. As a GPT model would say: "Small wrinkle: the production-ready benchmark results were tainted by real-world data points. I've assimilated the inconsistencies and added guardrails so that v2 has the right shape for future evaluations."
|