| HN Mirror

Joe here. It's difficult to evaluate natural language responses that come from LLM applications - there are not hard metrics to measure performance like there are in say supervised machine learning tasks. For RAG, you have the response to evaluate as well as the retrieved context chunks. We found that using gpt-4 as an evaluator to measure the quality of RAG responses and the relevance of the context chunks gave similar results to using human evaluators at Tonic to do the same task. Some research also agrees that using LLMs as an evaluator for natural language tasks gives similar results to using human evaluators https://arxiv.org/abs/2306.05685.

As far as whether using gpt-4 is a safe approach, the best you could ask for is that gpt-4's evaluations match those of human evaluators, and that's what we've found as well as this research.