Hacker News new | ask | show | jobs
by agautsc 968 days ago
if you build a dataset of question with responses to test you rag app with this metrics package, how do you know whether the distribution of questions match in any way with the distribution of question you'll get from the app in production? using a hand made dataset of questions and responses could introduce a lot of bias into your rag app.
1 comments

Joe here. This is a good question, and always a challenge when using a test set to measure the performance of an LLM/AI/ML application. The answer is to make the test dataset consist of questions that come from users of your app - this is the best way to get a representative sample of questions to test your RAG application with. If you do this though, your test dataset will consist of questions withouth reference "correct" answers. In this case, you can still use tvalmetrics to evaluate the responses by using the metrics in tvalmetrics that do rely on reference "correct" answers.

tvalmetrics introduces 6 RAG metrics: answer similarity, retrieval precision, augmentation precision, augmentation accuracy, answer consistency, and retrieval k-recall. Of these 6 metrics, only answer similarity requires reference answers, so you can use the other metrics to measure the performance of your RAG system when you have a test dataset of questions without reference "correct" answers.