| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Lienetic 743 days ago
	Where can I learn more detail about the metrics you support and how they work? I tried multiple other solutions but kept running into the problem that occasionally the framework would give me some score/evaluation of an LLM response that didn't make any sense, and there was minimal information about how it came up with the score. Often, I'd end up digging into the implementation of the framework to find the underlying evaluation prompt or classifier only to realize that the metric name is confusing or results are low confidence. I'm more cautious about using these tools now and look more deeply at how they work so that I can assess grading quality before relying on them to identify problematic outputs (e.g. hallucinations).

2 comments

resiros 743 days ago

I think the issue is that many of these metrics (e.g. RAGAS) are LLM as a judge metrics. These are very far from reliable. Making them reliable is still a research problem. I've seen a couple of startups training their own LLM judge models to solve this problem. There are also some work to attempt to improve the reliability through sampling such as G-eval (https://github.com/nlpyang/geval).

One need to think of these metrics as a way to filter all the data to find potential issues, and not as a final evaluation criteria. The golden criteria should be human evaluators.

link

Lienetic 743 days ago

Are there any approaches today that you've found are at least mostly reliable? Bonus points if it is somewhat clear/easy/predictable to know when it isn't or won't be.

We use human evaluation but that is naturally far from scalable, which has especially been a problem when working on more complicated workflows/chains where changes can have a cascading effect. I've been encouraging a lot of dev experimentation on my team but would like to get a more consistent eval approach so we can evaluate and discuss changes with more grounded results. If all of these metrics are low confidence, they become counterproductive since people easily fall into the trap of optimizing the metric.

link

nirga 743 days ago

I tend to find classic NLP metric more predictable and stable than "LLM as a judge" metrics so I'd try to see if you rely on them more.

We've written a couple of blog posts about some of them: https://www.traceloop.com/blog

link

swyx 743 days ago

for your blog can i offer a big downvote for the massive ai generated cover image thing? its a trend for normies but for developers its absolutely meaningless. give us info density pls

link

nirga 743 days ago

roger that! I like them though (am I a normie then?)

link

nirga 743 days ago

We trained our own models for some of them, and we combined some well known NLP metrics (like Gruen [1]) to make this work.

You're right that it's hard to figure out how to "trust" these metrics. But you shouldn't look at them as a way to get an objective number about your app's performance. They're more of a way to detect deltas - regressions or changes in performance. When you get more alerts, or more negative results (or less alerts / less negative results) - you can tell you're improving. And this works for tools like RAGAS as well as our own metrics in my view.

[1] https://www.traceloop.com/blog/gruens-outstanding-performanc...

link