| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vanuatu 27 days ago

Out of curiosity, I examined the worst task:

https://deepswe.datacurve.ai/data/trials/quill-shared-toolba...

It seems like GPT here is failing due to an environment issue of connecting to chromium, even though its local unit tests passed. All the models failed 4/4 and checking Opus it ran into the same problem

I checked some other tasks and they seemed legit, although in general the prompts seem somewhat contrived vs. what a typical user would ask their coding agent (such is the difficulty of benchmark construction)

1 comments

flakiness 26 days ago

> Prompts are shorter than SWE-Bench Pro's but still longer than how developers actually message agents. Behavioral verification needs some minimum specificity to know what surface to test against, which puts a floor on how terse a prompt can be before the test becomes ambiguous.

It'll be tricky to automate the verification with a vague prompt. In other words, the SWE's job these days is to be a intelligent verifier.

link