Hacker News new | ask | show | jobs
by kelseyfrog 108 days ago
Automated benchmarking.

We were lucky enough to have PMs create a set of questions, we did a round of generation and labeled pass/fail annotations on each response.

From there we bootstrapped AI-as-a judge and approximately replicated the results. Then we plug in new models, change prompts, pipelines while being able to approximate the original feedback signal. It's not an exact match, but it's wildly better than one-off testing and the regressions it brings.

We're able to confidently make changes without accidentally breaking something else. Overall win, but it can get costly if the iteration count is high.

1 comments

This is interesting approach, thanks for the insight! If I may ask, _approximately_ how long does it take to test a newly-released model with the current strategy?
20mins or so. The bottleneck is rate-limiting. It's amenable to parallelization. Each tests can run in isolation at the same time.
Gotchu. Yeah that's pretty quick, awesome thanks!