| HN Mirror

Joe here. This is a good question, and always a challenge when using a test set to measure the performance of an LLM/AI/ML application. The answer is to make the test dataset consist of questions that come from users of your app - this is the best way to get a representative sample of questions to test your RAG application with. If you do this though, your test dataset will consist of questions withouth reference "correct" answers. In this case, you can still use tvalmetrics to evaluate the responses by using the metrics in tvalmetrics that do rely on reference "correct" answers.

tvalmetrics introduces 6 RAG metrics: answer similarity, retrieval precision, augmentation precision, augmentation accuracy, answer consistency, and retrieval k-recall. Of these 6 metrics, only answer similarity requires reference answers, so you can use the other metrics to measure the performance of your RAG system when you have a test dataset of questions without reference "correct" answers.