|
|
|
|
|
by bisonbear
96 days ago
|
|
I agree with your analysis but not the conclusion. Evals are broken - OpenAI showed that SWE Bench Verified was in the training data - models were able to reconstruct the changes from memory (https://openai.com/index/why-we-no-longer-evaluate-swe-bench...) However, this doesn't mean we should completely give up on benchmarking. In fact, as models get more intelligent, and we give them more autonomy, I believe that tracking agent alignment to your coding standards becomes even more important. What I've been exploring is making a benchmark that is unique per-repo - answering the question of how does the coding agent perform in my repo doing my tasks with my context. No longer do we have to trust general benchmarks. Of course there will still be difficulties and limitations, but it's a step towards giving devs more information about agent performance, and allowing them to use that information to tweak and optimize the agent further |
|