|
|
|
|
|
by vanuatu
27 days ago
|
|
Out of curiosity, I examined the worst task: https://deepswe.datacurve.ai/data/trials/quill-shared-toolba... It seems like GPT here is failing due to an environment issue of connecting to chromium, even though its local unit tests passed. All the models failed 4/4 and checking Opus it ran into the same problem I checked some other tasks and they seemed legit, although in general the prompts seem somewhat contrived vs. what a typical user would ask their coding agent (such is the difficulty of benchmark construction) |
|
It'll be tricky to automate the verification with a vague prompt. In other words, the SWE's job these days is to be a intelligent verifier.