Hacker News new | ask | show | jobs
by rwojo 968 days ago
This package suggests building a dataset and then using LLM-assisted evaluation via GPT-3.5/4 to evaluate your RAG pipeline on the dataset. It relies heavily on GPT-4 (or an equivalent model) to provide realistic scores. How safe is that approach?
2 comments

Using LLMs to evaluate other LLMs sounds like a it would be dumb, but LLMs work in mysterious ways. I’ve found this approach useful though. In the context of RAG, using an LLM to evaluate whether a context chunk is relevant to answer a question is a nice complement to using the vector embedding semantic similarity search. Sometimes prompting the LLM gives better results than vector similarity.
Joe here. It's difficult to evaluate natural language responses that come from LLM applications - there are not hard metrics to measure performance like there are in say supervised machine learning tasks. For RAG, you have the response to evaluate as well as the retrieved context chunks. We found that using gpt-4 as an evaluator to measure the quality of RAG responses and the relevance of the context chunks gave similar results to using human evaluators at Tonic to do the same task. Some research also agrees that using LLMs as an evaluator for natural language tasks gives similar results to using human evaluators https://arxiv.org/abs/2306.05685.

As far as whether using gpt-4 is a safe approach, the best you could ask for is that gpt-4's evaluations match those of human evaluators, and that's what we've found as well as this research.