| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by PUSH_AX 35 days ago
	They set themselves up for flack when they use whatever these evals are… they did the same for composer 2 which was evaled in close competition with frontier models, spoiler alert, it wasn’t even close in practice. So now 2.5 is supposed to compete with opus 4.7? Sure…

4 comments

tuo-lei 35 days ago

they say it themselves in the post - behavior dimensions "not well captured by existing benchmarks". that was the exact problem with composer 2. not dumber on individual tasks, just bad at session-level decisions like when to stop editing, how much context to carry forward, when to re-read a file vs assume. you don't catch any of that in an isolated eval.

link

jmcqk6 34 days ago

That does not match my experience. Composer 2 was fantastic for my uses, and I hit Composer 2.5 with some very difficult things last night, which it handled fast and effectively. I don't really care about benchmarks. I care about practice, and in practice, it's been very very good for me.

link

infecto 35 days ago

As I have said before in prior composer threads. The proof is in the usage. I am inclined to somewhat believe the results as I use composer and also take the results for the given context. It’s not a general purpose sota model. It’s a model that runs inexpensively in their coding workflow that is creating results similar to opus or gpt.

link

criemen 35 days ago

Well is that a statement about the quality of Opus 4.7 or about compose 2.5? :P

link