| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nutellalover 659 days ago

Happy to help!

At the beginning, we started with qualitative "vibe" checks where we could iterate quickly and the delta in quality was still so significant that we could obviously see what was performing better.

Once we stopped trusting our ability to discern differences, we actually bit the bullet and made a small eval benchmark set (~20 queries across 3 repos of different sizes) and then used that to guide algorithmic development.

1 comments

peterldowns 659 days ago

Thank you, I appreciate the details.

link