| HN Mirror

I used to own the eval suite for a coding agent, it's certainly doable, even when it requires SQL + tables etc. We even had support for a wide range of data options ranging from canned csv data to plugging into prod to simulate the user experience, all easily configurable at eval run time. It also supported agentic flows where the results from one eval could be chained to the next (with a known correct answer being an optional send to check the framework end to end in the case of node failure).

Interestingly enough, we started with hundreds of evals, but after that experience my advice has become: less evals tied more closely to specific features and product ambitions.

By that I mean: some evals should serve as a warning ("uh oh, that eval failed, don't push to prod"), others as a mile stone ("woohoo! we got it work!"), and all should be informed by the product road map. You basically should understand where the product is going just by looking over the eval suite.

And, if you don't have evals, you really don't know if you're moving the needle at all. There were multiple situations where a tweak to a prompt passed an initial vibe check, but when run against the full eval suite, clearly performed worse.

The other piece of advice would be: evals don't have to sophisticated, just repeatable and agnostic to who's running them. Heck even "vibe checks" can be good evals, if they're written down and they need to pass some consensus among multiple people around whether they passed or not.