| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kelseyfrog 108 days ago

Automated benchmarking.

We were lucky enough to have PMs create a set of questions, we did a round of generation and labeled pass/fail annotations on each response.

From there we bootstrapped AI-as-a judge and approximately replicated the results. Then we plug in new models, change prompts, pipelines while being able to approximate the original feedback signal. It's not an exact match, but it's wildly better than one-off testing and the regressions it brings.

We're able to confidently make changes without accidentally breaking something else. Overall win, but it can get costly if the iteration count is high.

1 comments

yelmahallawy 107 days ago

This is interesting approach, thanks for the insight! If I may ask, _approximately_ how long does it take to test a newly-released model with the current strategy?

link

kelseyfrog 106 days ago

20mins or so. The bottleneck is rate-limiting. It's amenable to parallelization. Each tests can run in isolation at the same time.

link

yelmahallawy 106 days ago

Gotchu. Yeah that's pretty quick, awesome thanks!

link