|
|
|
|
|
by sudhirb
325 days ago
|
|
For coding agents, evaluations are tricky - thorough evaluation tasks tend to be slow and/or expensive and/or display a high degree of variance over N attempts. You could run a whole benchmark like SWE Bench or Terminal Bench against a coding agent on every change but it quickly becomes infeasible. |
|
Interestingly enough, we started with hundreds of evals, but after that experience my advice has become: less evals tied more closely to specific features and product ambitions.
By that I mean: some evals should serve as a warning ("uh oh, that eval failed, don't push to prod"), others as a mile stone ("woohoo! we got it work!"), and all should be informed by the product road map. You basically should understand where the product is going just by looking over the eval suite.
And, if you don't have evals, you really don't know if you're moving the needle at all. There were multiple situations where a tweak to a prompt passed an initial vibe check, but when run against the full eval suite, clearly performed worse.
The other piece of advice would be: evals don't have to sophisticated, just repeatable and agnostic to who's running them. Heck even "vibe checks" can be good evals, if they're written down and they need to pass some consensus among multiple people around whether they passed or not.