|
|
|
|
|
by nirga
694 days ago
|
|
We trained our own models for some of them, and we combined some well known NLP metrics (like Gruen [1]) to make this work. You're right that it's hard to figure out how to "trust" these metrics. But you shouldn't look at them as a way to get an objective number about your app's performance. They're more of a way to detect deltas - regressions or changes in performance. When you get more alerts, or more negative results (or less alerts / less negative results) - you can tell you're improving. And this works for tools like RAGAS as well as our own metrics in my view. [1] https://www.traceloop.com/blog/gruens-outstanding-performanc... |
|