Hacker News new | ask | show | jobs
Evaluations are crucial, but what should you eval on? (github.com)
2 points by draismaa 460 days ago
1 comments

LLM evaluations are tricky. You can measure accuracy, latency, cost, hallucinations, bias... but what really matters for your app? Instead of relying on generic benchmarks, build your own evals --> focused on your use case, and then, bring those evals into real-time monitoring of your LLM app. We open-sourced LangWatch to help with this.. How are you handling LLM evals in production?