Hacker News new | ask | show | jobs
by flakiness 19 days ago
> Prompts are shorter than SWE-Bench Pro's but still longer than how developers actually message agents. Behavioral verification needs some minimum specificity to know what surface to test against, which puts a floor on how terse a prompt can be before the test becomes ambiguous.

It'll be tricky to automate the verification with a vague prompt. In other words, the SWE's job these days is to be a intelligent verifier.