|
|
|
|
|
by kelseyfrog
108 days ago
|
|
Automated benchmarking. We were lucky enough to have PMs create a set of questions, we did a round of generation and labeled pass/fail annotations on each response. From there we bootstrapped AI-as-a judge and approximately replicated the results. Then we plug in new models, change prompts, pipelines while being able to approximate the original feedback signal. It's not an exact match, but it's wildly better than one-off testing and the regressions it brings. We're able to confidently make changes without accidentally breaking something else. Overall win, but it can get costly if the iteration count is high. |
|