| I wish there was a new kind of benchmark that...wasn't focused on prompt-to-complete-task completion, rather on how well a model can act an assistant. At my day job, despite all the harnessing and providing extensive documentation and user stories via E2Es, I cannot trust models to deliver quality output. They are unable to, and reviewing 18 files of changes is the kind of work that increases my load and effort. And yes, we have already split and optimized our documentation to not overwhelm the context. In order to do this, the best flow is planning together, finding edge cases, having review skills, iterating, producing a business logic focused document describing the changes -> iterating to get a code changeset focused document. Then I want to review step by step all the edits the model does. On average this triplicates the amount of time required for a major change, but significantly improves business logic correctness and code quality, with the major benefit that it will require significantly less maintenance down the line and thus ends up being both a benefit on one side, and to improve harness on the other (more quality code, proper information, better examples for the models in the future). The issue is: models are increasingly getting worse at this kind of work. While it is clear that they have better capabilities, the feedback loop has definitely degraded between Opus 4.7 and Opus 4.8, much more than it did between Opus 4.5 and 4.7. This is very disappointing to me, as it is crystal clear that models are increasingly reinforced to deliver from prompt to the end result on their own and keep me more and more left out of the loop. This has resulted in increasing frustration and makes my work slower, not better. |