Hacker News new | ask | show | jobs
by visarga 736 days ago
It gets more interesting when you get to benchmarking your prompts for accuracy. If you don't have an evaluation set you are flying blind. Any model update or small fix could break edge cases while you don't know.
2 comments

We are using benchmarking on our own eval sets, which makes it easier to measure the variance that I’ve found impossible to eliminate.
Make sure you don’t upload that evaluation set to any service that resells data (or gets scraped) for LLM training!