Evaluations are crucial, but what should you eval on?

LLM evaluations are tricky. You can measure accuracy, latency, cost, hallucinations, bias... but what really matters for your app? Instead of relying on generic benchmarks, build your own evals --> focused on your use case, and then, bring those evals into real-time monitoring of your LLM app. We open-sourced LangWatch to help with this.. How are you handling LLM evals in production?