| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jangletown 356 days ago

That's true, we have been trying to help customers doing evals for ages now, and it's super hard for everyone to build a really good dataset and define great quality metrics

just wanted then to shameless plug this lib I've built recently for this very topic, because it's been much easier to sell that into our clients than evals really, because it's closer to e2e tests: https://github.com/langwatch/scenario

instead of 100 examples, it's easier for people to think on just the anecdotal example where the problem happens and let AI expand it, or replicate a situation from prod and describe the criteria in simple terms or code