|
|
|
|
|
by resiros
696 days ago
|
|
I think the issue is that many of these metrics (e.g. RAGAS) are LLM as a judge metrics. These are very far from reliable. Making them reliable is still a research problem. I've seen a couple of startups training their own LLM judge models to solve this problem. There are also some work to attempt to improve the reliability through sampling such as G-eval (https://github.com/nlpyang/geval). One need to think of these metrics as a way to filter all the data to find potential issues, and not as a final evaluation criteria. The golden criteria should be human evaluators. |
|
We use human evaluation but that is naturally far from scalable, which has especially been a problem when working on more complicated workflows/chains where changes can have a cascading effect. I've been encouraging a lot of dev experimentation on my team but would like to get a more consistent eval approach so we can evaluate and discuss changes with more grounded results. If all of these metrics are low confidence, they become counterproductive since people easily fall into the trap of optimizing the metric.