|
My belief is that software engineering benchmarks are still a poor proxy for performance on real world software engineers tasks, and that there's a decent chance a new model might saturate a benchmark while being kind of underwhelming. A simple example, if a human scored 50% on SWE-bench verified, it's fair to say that this person is a very competent software engineer. Popular frontier models like Claude Sonnet and OpenAI's o3 can score 50% on SWE-bench, and can score even higher with special tooling, but compared to an actual human software engineer can't seem to competently perform a lot of programming tasks on their own. Although, if a model did consistently score more than 99% on various software engineering benchmarks that might be different, as it would imply a very real sense of competence. That's a pretty substantial if though. To my knowledge there isn't a single model out there that can consistently score more than 99% on any given benchmark. The o1 model scored very well on certain MMLU categories, 98.1% on college mathematics, but I'm not sure if this result will continue to hold on a similar benchmark evaluating college-level undergraduate mathematics. Also, something else to consider, we take for granted how often we're able to perform tasks with more than 99% accuracy and how quickly things would fall apart if this weren't the case. If the average human driver was only able to make an accident-free trip only 99% of the time that would imply that they'd get in a wreck every 100th time they drive their car. Granted, software engineering might be the exception to this rule, but then again, depends on what you're measuring. When it comes to more-or-less discrete steps, we're arguably pretty good at writing programs that capture our intent, and I could foresee an AI that only gets this right 99% of the time to be a pain in the ass to work with. If a feature ticket requires 10 different sub-tasks to be done correctly, then an AI that can do each sub-task correctly 99% of the time has a roughly 90% chance of doing the whole feature ticket correctly, which is still good but compounded over many feature tickets could be exhausting to deal with. An AI that has only a 90% chance of doing each sub-task correctly would almost certainly fail to implement this hypothetical feature ticket. Mind you, statistics is not my domain so if there are any errors in my logic please correct me. |