Hacker News new | ask | show | jobs
by wunderlotus 35 days ago
I love Cursor as a tool, but I'm skeptical bc:

1/ CursorBench is so opaque [1] that it makes it hard to trust. Not to mention the v3.1 eval is a newer iteration and there's no insight into the tasks or if the model was just tuned to max it out. Composer 2 previously scored between 60-65% on the previous benchmark eval [2] but scores between 50-55% on CB v3.1[3].

2/ I've experienced Composer 2's performance and it leaves much to be desired as a daily driver for a knowledge worker. but KWs are obviously not the target users and I can see how it's cost-efficient for executing on clearly-defined, discrete coding tasks. Obviously that's their value proposition and they're figuring out how to communicate it well to the target customer. It just doesn't feel like CursorBench is that.

[1] https://cursor.com/blog/cursorbench#building-cursorbench

[2] https://cursor.com/blog/composer-2-technical-report#performa...

[3] https://cursor.com/blog/composer-2-5