|
|
|
|
|
by flakiness
19 days ago
|
|
> Prompts are shorter than SWE-Bench Pro's but still longer than how developers actually message agents. Behavioral verification needs some minimum specificity to know what surface to test against, which puts a floor on how terse a prompt can be before the test becomes ambiguous. It'll be tricky to automate the verification with a vague prompt. In other words, the SWE's job these days is to be a intelligent verifier. |
|